A Deep Q-Network Based-Resource Allocation
A Deep Q-Network Based-Resource Allocation
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 1
1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 2
Consider a single-cell multi-users downlink system. The where B is the bandwidth of one user channel. C1 is the
base station (BS) is equipped with Nt antennas and serves power allocation factor constraint for each user cluster and
L single-antenna users. All users in the cell are divided into C2 denotes the total power constraint of the BS over one user
N clusters, each of which includes K users. The users’ channel. C3 can ensure the minimum data rate for each user
data in one cluster are transmitted in the form of power- and C4 is the norm constraint of the beamforming matrix.
domain NOMA signal structure and preprocessed by the same Because the joint optimization problem (4) is non-
beamforming vector. Assume that the BS deploys antennas on convex[5], the traditional solutions are always the heuristic
the Y -Z plane in terms of the uniform planar array (UPA). or the alternating iterative methods. These methods exist the
Then, the channel vector from the BS to user m can be drawbacks of high complexity or limited performance. While,
modeled similarly as in [14], which is given by DRL can not only fully explore the hidden information of big
q
1 X
Lu data to improve its own learning performance, but also realize
hm = ( dm/d0 )µ √ gm,l b(vm,l ) ⊗ a(um,l ) (1) the dynamic real-time interaction. This method has strong
Lu l=1
generalization ability and highlights its advantages in wireless
Here d0 is the radius of cell and dm is the distance between the RA. Therefore, this letter proposes a joint optimization method
user and the BS. µ is the large-scale fading factor. b(vm,l ) and to solve the problem (4) based on the DRL in Section III.
a(um,l ) are the vertical and horizontal array response vectors
of the UPA antenna, respectively. The symbol “⊗” in formula III. DQN BASED -R ESOURCE A LLOCATION SCHEME
(1) represents the kronecker product of two matrices. Lu is
the number of scattering paths and g denotes the small-scale The proposed RA network based on DQN is shown in Fig.1
fading coefficient. and it includes three parts: user clustering, power allocation
Assume the data sent by the BS being X = [x1 · · · xN ]T ∈ and beamforming. While, the first two parts are mainly con-
XK p cerned in this section.
C N ×1 , where xn = αn,k Pn sn,k is the superposed
k=1
NOMA signal of K users in the n-th cluster. Here, Pn is the
A. User Clustering based on DQN
total power of the n-th cluster. αn,k and sn,k are the power
allocation factor and the transmitted symbol of the k-th user in The user clustering problem is modeled as a RL task, which
the n-th cluster, denoted by Un,k , respectively. It is assumed consists of the agent and the environment. Specifically, the user
2
that E[|sn,k | ] = 1. The received signal of user Un,k is clustering module is taken as an agent and the performance of
p the massive MIMO-NOMA system is the environment. The
yn,k = hn,k W X +zn,k = hn,k wn αn,k Pn sn,k+ actions {at } taken by the agent are based on the expected
K N
X p X (2) rewards from the environment. According to the considered
hn,k wn αn,k Pn sn,k +hn,k wi xi +zn,k
j=1,j6=k i=1,i6=n
system, each part of the RL framework is described as follows:
1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 3
State space S: The CSI of all users in each cluster forms from the whole action space with the exploration probability
the current state of the t-th iteration, which is given by st = ε and determine an action to receive the maximum Q-value
{[h1,1 , · · · h1,K ], · · · [hN,1 , · · · , hN,K ]}. according to the Q-network output with the probability (1−ε).
Action space A: It should contain actions that can cover all
possible clustering results. When there are L users, the number B. Power Allocation based on BPNN
of possible actions reaches to CLK CL−K K K
· · · CL−(N −1)K /N !. In order to ensure the effectiveness of the SIC receiver in
The size of the action space will increase dramatically power-domain NOMA, the power of users in the same cluster
with the increase of users. The purpose of taking ac- needs to be assigned appropriately. Power allocation is key
tion is to select a suitable group for each user. Taking to compromise between the system sum rate and the users’
t t t t
the action at = {[U1,1 , · · · , U1,K ], · · · , [UN,1 , · · · , UN,K ]} fairness. Different from the traditional optimization algorithm,
for the state st will result in the new state st+1 . a BPNN based power allocator is designed to reduce the
at
This impact is defined as st → st+1 . Here st+1 is computation complexity as well as obtain a good performance.
{[h1,1 , · · · , h1,K ], · · · , [hN,1 , · · · , htN,K ]}, which fully corre-
t t t The task of power allocation is to calculate the users’
sponds to the new user grouping result at . And htn,k is the power allocation factors αn = [αn,1 ,...,αn,K ], αn,k ∈ [0, 1]
t for each group under a given user clustering result. In order
CSI of the user Un,k , i.e., the k-th user in the n-th group for
the current action at . to explore the nonlinear mapping ability of the BPNN’s to
Reward function: Here, the system sum rate rt = Rsum extract the relationship between the power allocation and the
is taken as the reward function, which is also related with users’ CSI among a cluster, the BPNN needs to be trained
{αn,k } and {wn }. The ideal goal ofX RL is to maximize the based on large amounts of labeled data. Here, the result of
∞ exhaustive search based power allocation (ESPA) algorithm
cumulative discounted reward Rt = γ i rt+i , where the
i=0 executed among a finite power allocation factor set, α b n is
discount factor is γ ∈ [0, 1]. Obviously, the action of each used as the training label for BPNN. Here, the finite set
iteration can only be finalized after all iterations have been for ESPA is obtained by discretizing the continuous factor
completed with such a target. Specifically, a state-action value range with a small step size. In order to ensure the fairness
(Q-value) function defined in (5)[15], which can determine the of users, the constraint Rn,k ≥ Rmin is realized by the
current action only based on the next iteration’s value function, ESPA. Because the beamforming matrix are unknown when
is used in Q-learning. In order to find the optimal at for the generating the training data for the BPNN, the calculation
given state st to make the Q-value maximized, Eq.(5) has to of Rn,k does not involve W and its expression is Rn,k =
be calculated for all actions. If L is very large, the complexity 2 2
XK
will be extremely high and the algorithm converges slowly. B log(1+khn,k k αn,k Pn /(khn,k k αn,j Pn + σ 2 )).
j=k+1
The BPNN consists of input layer, output layer and hidden
Q(st , at ) = E[rt + γ max Q(st+1 , at+1 ) |st , at ] (5)
at+1 layers. The hidden layers of the BPNN adopt the rectified
Deep Q-Network: To speed up the convergence of Q- linear unit (ReLu) activation function [16] and other
learning, a deep Q-Network is adopted to estimate the Q- layers use the linear activation function. The input of the
values. A fully connected neural network with three hidden BPNN is channel information {khn,1 k , · · · , khn,K k} of
layers is involved. Its input and output are the current state one cluster’s users, and the output is the corresponding
and the estimated Q-values corresponding to each action, power allocation factors αn . The loss function is defined as
respectively. The function of the Q-network is represented as Loss = min kαn −α b n k2 and the stochastic gradient descent
ωB ,bB
Q (st , at , ω), where ω is the weight set to be trained. In order (SGD) method is used to update the network parameters
to ensure the convergence of Q-network’s parameters updating, {ωB , bB } in training. The trained network can directly
a target Q-network, which has the same structure and the initial calculate the power allocation result based on the channel
weights as the Q-network but keeps the old weights ω C− of information and the online calculation complexity could be
C iterations ago during working, is included to provide the greatly reduced.
relatively stable labels for Q-network training. Hence, the loss Obviously, different BPNNs need to be trained for different
function used in training is given by configurations of K. In our work, the number users K in
L(ω) = E[(r + γ max
0
Q(s0 , a0 , ω C− ) − Q(s, a, ω))2 ] (6) every cluster is assumed identical. Then, the same pre-trained
a
BPNN can be used by all the clusters. In a practical system,
where the label is y = r + γ max
0
Q(s0 , a0 , ω C− ) calculated the parameter K might be different for different user clusters
a
based on the old weights ω C− and the new state s0 after according to the users’ access requirement and their CSI.
taking action a from state s. In addition, the experience In this case, multiple BPNNs need to be pre-trained for
replay strategy used in [15] is adopted in training. First, the all the cases of K. Fortunately, the possible values of K
observation samples (st , at , rt , st+1 ) in previous iterations are is limited in a very small set. From the various references
stored. Then a mini-batch of samples are randomly selected surveyed, it can be seen that the number of superimposed
from the replay memory to feed into the target Q-network to users in a NOMA signal structure is usually selected as 2
get the training labels. The objective of this process is to break or 3[4][12][17][18]. Actually, with K being larger than 3, the
the correlations among data samples and make training conver- error propagation in the SIC decoder becomes more serious
gent. Following the Q-network, ε-greedy strategy is utilized to and the total system performance will decrease instead. This
choose the current action. That is to select an action randomly result can be seen from comparing the simulation curves in
1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 4
TABLE I 20
S IMULATION PARAMETERS
1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 5
14 V. C ONCLUSIONS
This work mainly studies the downlink RA problem in
Spectrum efficiency (bit/s/Hz)
12 the massive MIMO-NOMA system. In order to maximize
the system spectrum efficiency under the premise of ensuring
10 the worst user performance constraint, a deep Q-learning
network and a BP neural network are designed to realize the
8 joint user clustering and the intra-cluster power allocation,
respectively. The simulation results demonstrate the advantage
6 NLUPA[6]-FTPA[7] of our scheme on improving system spectrum efficiency.
ES-BPNN
DQN-BPNN (proposed)
4 DQN-FTPA[7] R EFERENCES
ES-FTPA[7] [1] F. Tang, Y. Kawamoto, N. Kato and J. Liu, “Future Intelligent and
ES-ESPA Secure Vehicular Network Toward 6G: Machine-Learning Approaches,”
2 Proceedings of the IEEE, vol. 108, no. 2, pp. 292-307, Feb. 2020.
0 1 2 3 4
Power (W) [2] Z. Shi, W. Gao, S. Zhang, J. Liu and N. Kato, “AI-Enhanced Cooperative
Spectrum Sensing for Non-Orthogonal Multiple Access,” IEEE Wireless
Fig.4 The system total spectrum efficiency (L=8,K=4) Communications, vol. 27, no. 2, pp. 173-179, April 2020.
[3] Zhang G, Wang B, Li G, et al., “Interference Management by Vertical
Beam Control Combined with Coordinated Pilot Assignment and Power
CDF Allocation in 3D Massive MIMO Systems,” KSII Trans. Internet and Inf.
1
Syst.,vol. 9, no. 8, pp. 2797-2820, Aug. 2015.
[4] Z. Ding, F. Adachi and H. V. Poor, “The Application of MIMO
to Non-Orthogonal Multiple Access,” IEEE Transactions on Wireless
0.8 Communications, vol. 15, no. 1, pp. 537-552, Jan. 2016.
[5] Y. Sun, D. W. K. Ng, Z. Ding and R. Schober, “Optimal Joint Power
and Subcarrier Allocation for Full-Duplex Multicarrier Non-Orthogonal
0.6 Multiple Access Systems,” IEEE Transactions on Communications, vol.
65, no. 3, pp. 1077-1091, March 2017.
F(R)
1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.