Deep Reinforcement Learning for
Cloud-Edge
Dr. Rajiv Misra
Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Deep Reinforcement Learning for Cloud-Edge
Preface
Content of this Lecture:
• In this lecture, we will discuss how Collaborative cloud-edge
approaches can provide better performance and efficiency
than traditional cloud or edge approaches.
• To understand how resource allocation strategies can be
tailored to specific use cases and can evolve over time
based on user demand and network conditions.
Deep Reinforcement Learning for Cloud-Edge
The Collaborative Cloud-Edge Environment
Introduction:
• The "user-edge-cloud" model refers to a
distributed computing environment where
resources are allocated across user devices, edge
nodes, and cloud servers.
• Resource allocation is important for optimizing
system performance while ensuring efficient use of
resources.
• Collaborative cloud-edge approaches can be more
effective than traditional approaches that focus
solely on cloud or edge resources.
Cloud Services:
• Cloud services can be divided into private and public cloud.
• Private cloud is dedicated to a single organization and provides greater control and
security.
• Public cloud is shared by multiple organizations and provides more flexibility and
scalability.
Deep Reinforcement Learning for Cloud-Edge
The Collaborative Cloud-Edge Environment
Edge Nodes:
• Edge nodes are local computing resources that are
closer to the user than the cloud node.
• Edge nodes can provide low-latency, high-bandwidth
services to users and can offload some processing from
the cloud.
Resource Allocation Strategies:
• Cloud Resource allocation strategies can be based on various factors, such as user
demand, network conditions, and available resources.
• Collaborative cloud-edge approaches can use machine learning algorithms to optimize
resource allocation over time.
• Load balancing, task offloading, and caching are some common resource allocation
techniques that can be applied to both cloud and edge resources.
Multi-Edge-Node Scenario:
• Cloud In a multi-edge-node scenario, resource allocation becomes more complex as the
cloud and edge nodes must coordinate with each other to allocate resources effectively.
• Collaborative cloud-edge approaches can use communication protocols and data sharing
to enable effective coordination.
Deep Reinforcement Learning for Cloud-Edge
Public vs Private Cloud
Public Cloud Environment:
• In a public cloud environment, the cloud provider offers different pricing modes for
cloud services based on demand characteristics.
• Pricing modes have different cost structures that affect resource allocation strategies.
• Cloud service providers like Amazon, Microsoft, and Alicloud provide three different
pricing modes, each with different cost structures.
• The edge node must select the appropriate pricing mode and allocate user demands
to rented VMs or its own VMs.
Private Cloud Environment:
• In a private cloud environment, the edge node has its own virtual machines (VMs) to
process user demands.
• If the number of VMs requested exceeds the edge node's capacity, the edge node can
rent VMs from the cloud node to scale up.
• The cost of private cloud changes dynamically according to its physical computing cost,
so the edge node needs to allocate resources dynamically at each time slot according to
its policy.
• After allocating resources, the computing cost of the edge node and private cloud in this
time slot can be calculated and used to receive new computing tasks in the next time
slot.
Deep Reinforcement Learning for Cloud-Edge
User Settings
The time is discretized into T time slots.
We assume that in each time slot t, the demand submitted by the user can be
defined as the following:
D t =( d t , l t )
D t is a pair of d t and l t , where d t is the number of VMs requested of D t , and l t is the
computing time duration of D t .
Computing Resources and Cost of Edge Nodes:
• The total computing resources owned by the edge node are represented by E.
• As the resource is allocated to users, we use 𝒆𝒕 to represent the number of
remaining VMs of edge node in time slot t.
• The number of VMs provided by the edge node is expressed as 𝒅𝒆𝒕 .
• The number of VMs provided by the cloud node is expressed as 𝒅𝑪𝒕 .
• It should be noted that if the edge node exhibits no available resources, it will
hand over all the arriving computing tasks to the cloud service for processing.
So, no. of VM provided by edge node in time t is given as:
𝒆 𝒅𝒕 − 𝒅𝒄𝒕 , 𝒆𝒕 ≥ 𝟎
𝒅𝒕 = $
𝟎, 𝒆𝒕 = 𝟎
Deep Reinforcement Learning for Cloud-Edge
Computing Resources and Cost of Edge Nodes:
• When the resource allocation is successfully performed on the edge node,
each demand processed by the edge node will generate an allocation record.
𝒉𝒕 = 𝒅𝒆𝒕 , 𝒍𝒕
• When a new demand arrives and resource allocation is completed, an
allocation record will be generated and added to an allocation record list:
𝐻 =< ℎ% , ℎ& ,…. ℎ' >
At the end of each time slot, the following actions are taken:
• The edge node traverses the allocation record list and subtracts one from the
remaining computing time of each record.
• If a record's remaining computing time reaches 0, it means that the demand
has been completed. The edge node releases the corresponding VMs and
deletes the allocation record from the list.
• The number of VMs waiting to be released at the end of time slot t is denoted
as η( .
η( = ∑𝒎 𝒅
𝒊*𝟏 𝒊
𝒄
𝑠. 𝑡. 𝑙- = 0, ℎ- ∈ 𝐻
Deep Reinforcement Learning for Cloud-Edge
Computing Resources and Cost of Edge Nodes:
• The number of remaining VMs at the next time slot t+1 is calculated based on
the number of remaining VMs at the beginning of time slot t, the quantity
allocated in the end of time slot t, and the quantity released due to completion
of the computing task in time slot t. Then, the number of remaining VMs of the
edge node at the time slot t + 1 is
𝑒(.% = 𝑒( − 𝑑(/ + η(
• The cost of the edge node in time slot t is calculated as the sum of standby
cost (𝑒( 𝑝/ ) and computing cost ((𝐸 − 𝑒( )𝑝0 ).
𝐶(/ = 𝑒( 𝑝/ + 𝐸 − 𝑒( )𝑝0
Deep Reinforcement Learning for Cloud-Edge
Cost of Collaborative Cloud-Side Computing
Cost in Private Cloud:
• In time slot t, the cost of collaborative cloud-edge in private cloud environment is the
following:
"#$
𝐶! = 𝑑!% 𝑝% + 𝐶!&
Where,
𝑑!% : number of VMs provided by cloud node
𝑝% : unit cost of VMs in private cloud
𝐶!& : cost of the edge node
Cost in Public Cloud:
• In time slot t, the cost of collaborative cloud-edge in public cloud environment includes the
computing cost of cloud nodes and the cost of edge node, which is the following:
"'(
𝐶! = 𝑋)𝑝*+ 𝑑!% + 𝑋,𝑝'"-#*.! + 𝑋/𝑝#& 𝑑!% + 𝑋0𝑝! 𝑑!% + 𝐶!&
𝟏, The service is used
𝑋$ = '
𝟎, The service is not used
Where,
𝑋)𝑝*+ 𝑑!% : cost of on-demand instance
𝑋,𝑝'"-#*.! + 𝑋/𝑝#& 𝑑!% : cost of reserved instance
𝑋0𝑝! 𝑑!% : cost of spot instance
Deep Reinforcement Learning for Cloud-Edge
Goal
• The time is divided into T time slots, and at the beginning of each time
slot t, the user submits its demand to the edge node.
• The edge node allocates the demands to either cloud VMs or its own VMs
based on its resource allocation strategy.
• In a public cloud environment, the edge node determines the type of cloud
service to be used based on the allocation and the price of the corresponding
cloud service set by the cloud service provider.
• The cost of the current time slot t, denoted as 𝐶( , is calculated based on the
allocation and the price of the corresponding cloud service set by the cloud
service provider.
• The long-term cost of the system is minimized over the T time slots by
minimizing the sum of the costs over all time slots i.e.
1
> 𝐶(
(*%
Deep Reinforcement Learning for Cloud-Edge
Resource Allocation Algorithms: 1.Markov Decision Process
• The resource allocation problem is a sequential decision-making problem
• It can be modeled as a Markov decision process.
• Markov decision process is a tuple (S, A, P, r, γ), where S is the finite set of
states, A the finite set of actions, P is the probability of state transition, r and γ
are the immediate reward and discount factor, respectively.
• 𝒔𝒕 = (𝒆𝒕, 𝜼𝒕 − 𝟏, 𝑫𝒕, 𝑝! ) ∈ 𝑆 ,is used to describe the state of the edge node at the
beginning of each time slot, where
et :number of remaining VMs of the edge node in t,
ηt−1 :number of VMs returned in the previous time slot
Dt :user’s demand information in t
pt :unit cost of VMs in private cloud in t.
• 𝒂𝒕 = (𝒙𝒆, 𝒙𝒌) ∈ 𝐴, where
xe :ratio of the number of VMs provided by the edge node to the total number of VMs.
xk :ratio of the number of VMs provided by the cloud node to the total number of VMs.
• 𝒓𝒕 = −𝑪𝒕𝒑𝒓𝒊 is the reward in each time slot
Note :
We want to reduce the long-term operation
cost R = ∑:$9) 𝑟(𝑠$ , 𝑎$ ) therefore, the reward
function is set as a negative value of the cost.
Deep Reinforcement Learning for Cloud-Edge
2. Parameterized Action Markov Decision Process
• In the public cloud environment, first, the edge node needs to select the pricing mode of
cloud service to be used and then determine the resource segmentation between the
edge node and the cloud node in each time slot t.
• The resource allocation action can be described by parametric action.
• In order to describe this parameterized action sequential decision, parameterized action
Markov decision process (PAMDP) is used.
• Similar to Markov decision process, PAMDP is a tuple (S, A, P, r, γ).
• The difference with the Markov decision process is that A is the finite set of
parameterized actions.
• The specific modeling is as follows.
• st = (et, ηt−1, Dt, pt, ξt) ∈ S, where pt is the unit cost of spot instance in t, and ξt is the
remaining usage time of reserved instance. When the edge node does not use this type
of cloud service or it expires, this value is 0.
• at = (xe, (k, xk)) ∈ A, where K = {k1, k2, k3} is the set of all discrete actions, k1 is the on-
demand instance, k2 is the reserved instance, and k3 is the spot instance.
𝒑𝒓𝒊
• 𝒓𝒕 = −𝑪𝒕 is the reward in each time slot.
Deep Reinforcement Learning for Cloud-Edge
3.Resource Allocation Based on Deep Deterministic Policy
Gradient
• The DDPG algorithm is the classical algorithm of the ActorCritic algorithm
• Actor generates actions based on policies and interacts with the environment
• Critic evaluates Actor’s performance through a value function that guides Actor’s
next action
• This improves its convergence and performance.
DDPG introduces the idea of DQN and contains four networks, where the main Actor
network selects the appropriate action a, according to the current state, s and interacts
with the environment:
𝑎 = 𝜋& 𝑆 + 𝓝
where 𝓝 is the added noise
For the Critic master network, the loss function is,
(
' -
∇𝐽 𝜔 = : 𝑦,̇ − 𝑄 𝜙 𝑠) , 𝑎) , 𝜔 (1)
(
)*'
Where 𝑦,̇ is target Q value , calculated as ,
𝑦,̇ = 𝑟,̇ + 𝛾𝑄. 𝜙 𝑠 .) , 𝜋& 𝜙 𝑠 .) , 𝜔′ (2)
Deep Reinforcement Learning for Cloud-Edge
3.Resource Allocation Based on Deep Deterministic Policy
Gradient
For the Actor master network, the loss function is:
(
'
∇𝐽(𝜃) = C ∇/ 𝑄 𝑠0 , 𝑎0 , 𝜔 |1*1!,/*3" 1 ∇/ 𝜋& (𝑠)|1*1! (3)
( )*'
The parameters ω of the Actor target network and the parameters θ of the Critic
target network are updated using a soft update:
𝜔' ⃪𝜏𝜔 + 1 − 𝜏 𝜔.
(4)
𝜃 . ⃪𝜏 + 1 − 𝜏 𝜃 .
Deep Reinforcement Learning for Cloud-Edge
Resource Allocation Algorithms
3. Resource Allocation Based on Deep
Deterministic Policy Gradient
• DDPG structure is shown in figure
• Input of the algorithm contains information about the user requests
demands Dt and the unit cost of VMs in private cloud
• At beginning of each iteration, the edge node first obtains state st of
the collaborative cloud-edge environment
• It then pass the state as the input of the neural network into the main
Actor network to obtain the action at.
• After the edge node gets the action, the number of demands to be
processed by the edge node and the number of demands to be
processed by the private cloud will be calculated by the action value,
i.e., 𝑑!" and 𝑑!# , respectively.
• Then, interaction with the environment based on 𝑑!" and 𝑑!# , to get the
next state, reward, and termination flag.
• Storing this round of experience to the experience replay pool
• CERAI will sample from the experience replay pool and calculate the
loss functions of Actor and Critic to update the parameters of the
master and target networks.
• After one round of iterative, the training will be continued to the
maximum number of training rounds set to ensure the convergence of
the resource allocation policy.
Deep Reinforcement Learning for Cloud-Edge
CERAI(Cost efficient resource allocation with private cloud ) Algorithm
1. Initialize Actor main network and target network parameters 𝜃, 𝜃 ; Critic main network and target
network parameters 𝜔, 𝜔; , . soft update coefficient 𝜏. number of samples for batch gradient
descent m, maximum number of iterations M, random noise 𝓝 and experience replay pool K
2. For i = 1 to M do
3. Receive user task information and obtain the status s of collaborative cloud-edge
computing environment;
4. Actor main network selects actions according to s: 𝑎 = 𝜋< 𝑆 + 𝓝;
5. The edge node performs action a and obtains the next satus s', reward r and termination flag
𝑖𝑠𝑒𝑛𝑑
6. The edge node generates an allocation record ℎ$ according to the allocation operation. Add it
to the allocation record H;
7. Add the state transition tuple (𝑠, 𝑎, 𝑟, 𝑠 ; , 𝑖𝑠𝑒𝑛𝑑) in the experience replay pool K;
8. Update status: s = s’;
9. Sample m samples from experience replay pool P calculate the target Q value y according to the
eq 2;
10. Calculate the loss function according to (1) and update the parameters of the Critic main
network;
11. Calculate the loss function according to (3) and update the parameters of the Actor main network;
12. update the parameters of the Critic and Actor target network according to (4)
13. Update allocation record H and release computing resources for completed tasks;
14. If s’ is terminated, complete the current round of iteration, otherwise goto step 3;
15. end.
Deep Reinforcement Learning for Cloud-Edge
4. Resource Allocation Based on P-DQN
The basic idea of P-DQN is as follows.
• For each action a ∈ A in the parametric action space, because
of xe + xk = 1, we can only consider k and xk in the action value
function, that is Q (s, a) = Q (s, k, xk), where s ∈ S, k ∈ K is the
discrete action selected in the time slot t, and xk ∈ Xk is the
parameter value corresponding to k.
• Similar to DQN, deep neural network Q (s, k, xk; ω) is used in
P-DQN to estimate Q (s, k, xk), where ω is the neural network
parameter.
• For Q (s, k, xk; ω), P-DQN uses the determined policy network
xk(·; θ): S → X k to estimate the parameter value 𝒙𝑸 𝒌 (s), where θ
is used to represent the policy network. That means the goal of
P-DQN is to find the corresponding parameters θ, when ω is
fixed. It can be written as the following
𝑸 𝒔 𝒌 , 𝒙𝒌 𝒔; 𝜽 ; 𝝎 ≈ 𝑸 𝒔, 𝒌, 𝒙𝒌 ; 𝝎 (5)
• Similar to DQN, the value of ω can be obtained by minimizing
the mean square error by gradient descent.
• In particular, step t, ωt and θt are the parameters of value
network and deterministic policy network, respectively.
• yt can be written as :
𝒚 = 𝒓 + 𝒎𝒂𝒙𝑸 𝒔; 𝒌, 𝒙𝒌 𝒔' , 𝜽𝒕 ; 𝝎𝒕 (6)
𝒌∈ 𝒌
where s′ is the next state after taking the mixed action a = (k, xk).
Deep Reinforcement Learning for Cloud-Edge
4. Resource Allocation Based on P-DQN
The loss function of value network can be written as the following:
𝟏
𝒍𝑸 𝝎 = 𝑸 𝒔, 𝒌, 𝒙𝒌 ; 𝒘 − 𝒚 𝟐 (7)
𝟐
loss function of a policy network can be written as
𝒍𝜽 𝜽 = − ∑𝒌𝒌, 𝝁 𝒔, 𝒌, 𝒙𝒌 𝒔; 𝜽 ; 𝝎 (8)
• P-DQN structure is shown in Figure .
• Cost Efficient Resource Allocation with public cloud (CERAU) is a resource allocation algorithm based on P-
DQN,. The input of the algorithm contains information about the user requests demands Dt and the unit cost of
spot instance in public cloud in time slot t pt.
• At the beginning of each iteration of the algorithm, the edge node first needs to obtain the state st of the
collaborative cloud-edge environment
• Then pass the state as the input of the neural network into the strategy network to obtain the parameter values
of each discrete action.
• After the edge node gets the action, it will select the appropriate public cloud instance type based on the
discrete values in the action and determine the number of public cloud instances to be used based on the
parameter values.
• Then, interaction with the environment occurs, to get the next state, reward, and termination flag.
• Storing this round of experience to the experience replay pool, CERAU will sample from the experience replay
pool and calculate the gradient of the value network and the policy network.
• Then, it will update the parameters of the corresponding networks.
• After one round of iterative, to ensure the convergence of the resource allocation policy, the training will be
continued to the maximum number of training rounds set.
Deep Reinforcement Learning for Cloud-Edge
CERAU Algorithm
Algorithm: Cost efficient resource allocation with public cloud (CERAU)
1. Initialize exploration parameters 𝜖, soft update coeficient 𝜏) and 𝜏, , number of samples for batch
gradient descent m, maximum number of iterations M, random noise 𝓝 and experience replay
pool P;
2. for i = 1 to M do
3. Receive user task information and obtain the status s of collaborative cloud-cdge computing
environment;
4. Calculate the parameter value of each instance type in the cloud service; 𝑥= ⃪𝑥= (𝑠!, 𝜃! ) + 𝓝 ;
5. Selects discrete actions according to 𝜖 −greedy strategy:
𝑟𝑎𝑛𝑑𝑜𝑚 𝑑𝑖𝑠𝑐𝑟𝑒𝑡𝑒 𝑎𝑐𝑡𝑖𝑜𝑛, 𝑟𝑛𝑑 > 𝜖
a= ' 𝑘, 𝑥 , 𝑘 = 𝑎𝑟𝑔𝑚𝑎𝑥
= =∈ @ 𝑄 𝑠, 𝑘, 𝑥= ; 𝜔 , 𝑟𝑛𝑑 ≥ 𝜖
6. The edge node performs action and obtains the next status s’, reward r and termination flag isend;
7. The edge node generates an allocation record ℎ$ according to the allocation operation. Add it to
the allocation record list H;
8. Add the state transition tuple (𝑠, 𝑎, 𝑟, 𝑠′ 𝑖𝑠𝑒𝑛𝑑) in the experience replay pool D;
9. Sample m samples from experience replay pool P, calculate the target Q value y according to (6);
10. Update satus: s = s’;
11. Calculate gradient 𝛻A 𝑙B 𝜔 and 𝛻< 𝑙< 𝜃 according to (7) and (8);
12. Update network parameters: 𝜔′ ← 𝜔 − 𝜏) 𝛻A 𝑙B 𝜔 ,𝜃′ ← 𝜃 − 𝜏, 𝛻< 𝑙< 𝜃
13. Update allocation record H and release computing resources for completed tasks:
14. If s’ is terminated, complete the current round of iteration. otherwise go to step 3:
15. end
Deep Reinforcement Learning for Cloud-Edge
Thank You