[go: up one dir, main page]

0% found this document useful (0 votes)
16 views5 pages

A Deep Q-Network Based-Resource Allocation

This paper presents a Deep Q-Network (DQN) based resource allocation scheme for massive MIMO-NOMA systems, focusing on user clustering, power allocation, and beamforming to maximize system throughput. The proposed method addresses the complexities of joint optimization in resource allocation, which has been shown to be NP-hard, by utilizing reinforcement learning techniques. Simulation results indicate that the DQN approach can achieve high spectrum efficiency, approaching that of exhaustive search methods.

Uploaded by

irfanahmed446470
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views5 pages

A Deep Q-Network Based-Resource Allocation

This paper presents a Deep Q-Network (DQN) based resource allocation scheme for massive MIMO-NOMA systems, focusing on user clustering, power allocation, and beamforming to maximize system throughput. The proposed method addresses the complexities of joint optimization in resource allocation, which has been shown to be NP-hard, by utilizing reinforcement learning techniques. Simulation results indicate that the DQN approach can achieve high spectrum efficiency, approaching that of exhaustive search methods.

Uploaded by

irfanahmed446470
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 1

A Deep Q-Network based-Resource Allocation


Scheme for Massive MIMO-NOMA
Yanmei Cao, Guomei Zhang, Guobing Li, Jia Zhang
Abstract—In this paper, a deep Q-learning network (DQN) efficiency greatly. In addition, each beam has to cover all the
based resource allocation (RA) scheme is proposed for the users in one cluster in a MIMO-NOMA system, rather than
massive multiple-input multiple-output (MIMO)- nonorthogonal one user in a MIMO-OMA system. Then, the trade off between
multiple access (NOMA) systems. The reinforcement learning
(RL) frame is developed to build an iterative optimization enhancing the intra-cluster coverage and eliminating the inter-
structure for user clustering, power allocation and beamforming. cluster interference becomes more difficult for beamforming.
Specifically, a DQN is designed to group the users based on the Base on the above analysis, joint optimization of user cluster-
reward item calculated after power allocation and beamforming. ing, power allocation (PA) and beamforming becomes more
The objective is to maximize the reward item, i.e., the system urgent for massive MIMO-NOMA.
throughput. Then, a back propagation neural network (BPNN)
is used to realize the power allocation. During the training of Unfortunately, such a joint RA problem with multiple
BPNN, the exhaustive search results in the quantized power set variables has been proven to be a NP-hard problem[5]. The
are taken as the output labels. Simulation experiments show that alternate optimization for three parts is usually used. For the
the proposed scheme can achieve high system spectrum efficiency user clustering sub-problem, the optimal solution is obtained
approximating to the exhaustive search based on user clustering by searching all clustering combinations exhaustively. For the
and power allocation.
scenario with a large number of users, some heuristic cluster-
Index Terms—Non-Orthogonal Multiple Access, Massive ing algorithms, such as random pairing and the next-largest-
Multiple-Input Multiple-Output, Resource Allocation, Deep Q-
difference-based user pairing algorithm (NLUPA)[6], have
learning Network, Back Propagation Neural Network.
been proposed to reduce the system complexity. As for PA,
some simple methods such as the fixed power allocation (FPA)
I. I NTRODUCTION and the fractional transmit power allocation (FTPA) have been

D UE to the extreme shortage of wireless spectrum re-


sources, how to further improve spectrum efficiency and
system capacity is still a key problem for the next generation
utilized[7]. In addition, considering the tight coupling between
user clustering and power allocation, some joint optimization
schemes have also been studied. In [8], to maximize the energy
of mobile communications[1]. Non-orthogonality and large efficiency, a joint user pairing and dynamic power allocation
dimensions are considered to be effective for tackling this scheme was proposed. In order to improve the service quality
problem[2]. NOMA could support the needs of large-scale for the cell-edge users, the dynamic user allocation and power
access by superposing signals from multiple users in the same optimization problems were studied by considering the user
orthogonal resource block. In addition, massive MIMO could fairness[9]. The above researches show that the joint optimal
realize a high capacity through equipping with a large-scale of transmission can work more efficiently. However, with the
antennas[3]. The combination of NOMA and massive MIMO increase in the number of antennas, how to improve the system
technology is able to exploit the degrees of freedom in both the capacity while avoiding the excessive computation complexity
power domain and space domain, thereby further to improve and energy consumption is the primary difficulty for the joint
the system spectrum efficiency. However, with the greatly optimization. With the development of machine learning (ML)
increasing of user number and density in massive MIMO- in wireless communications, ML solutions for RA and wireless
NOMA systems, various contradictions become significant, transmission continue to emerge[10], such as the channel
including the conflict between the transmission efficiency and allocation based on RL in [11], the power allocation and
the cumulative error propagation of successive interference subchannel assignment schemes based on the deep recurrent
cancellation (SIC) decoder, and the trade-off between the intra- neural network (RNN) architecture for a NOMA-based het-
cluster coverage enhancement and the inter-cluster interference erogeneous Internet of Things system in [12] and the long
suppression. Specifically, in order to obtain the advantages short term memory network based power allocation for NOMA
of power domain multiplexing, a large difference of channel systems in [13]. These works show the potential advantage
gain among the superimposed users is required and different of applying ML algorithms to the traditional communication
levels of power is allocated to the users[4]. Then, the SIC systems.
decoder detects the signal according to the power orders of Motivated by the previous research, this paper develops a
users. Thus, the error accumulation is an inherent problem of RL method to solve the complex joint RA problem for massive
the SIC decoder. The more users are superimposed, the error MIMO-NOMA systems. However, the traditional RL schemes,
propagation is more serious. This will limit the transmission such as Q-Learning, are not suitable for the case that the
state space is too large and the learning efficiency would be
Manuscript received December 26, 2020; revised January 19, 2021.
The research work reported in this paper is supported by the National extremely low. Therefore, a deep Q-learning network (DQN) is
Key R&D Program of China under the Grant No. 2020YFB1807702 and adopted to realize the joint RA scheme under the scene with a
the National Natural Science Foundation of China under the Grants No. large number of users. Then, a deep coupled iterative structure
61941119.
All the authors are with the School of Information and Communications involving three functional modules, namely user clustering,
Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, 710049, China. power allocation and beamforming, is established based on

1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 2

where W = [w1 · · · wN ] ∈ C Nt ×N is a beamforming matrix,


and hn,k ∈ C 1×Nt is the channel vector for user Un,k . zn,k
is a complex white Gaussian noise with zero mean and
variance σ 2 . The second and third terms after the second equal
sign in (2) represents the intra-cluster and inter-cluster inter-
ferences, respectively. The beamforming vector is designed
to eliminate inter-cluster interference, which should satisfy
hn wi = 0, i 6= n. However, it is difficult to achieve such
an ideal result in practice and the inter-cluster interference
couldn’t be ignored. Moreover, suppose that the SIC detector
Fig.1 Joint Optimization Network based on DQN in the receiver could cancel the interference from the previous
the deep reinforcement learning (DRL) frame. In the user users ideally, and the user channel quality for the n-th user
2 2 2
clustering stage, DQN is used to gradually adjust the clustering cluster ranks as khn,1 k ≤ khn,2 k ≤ · · · khn,K k . Thus, the
results to maximize the system throughput, which is calculated signal to interference plus noise ratio of user Un,k is
by the environment evaluator based on the previous clustering |hn,k wn |2 αn,k Pn
Φn,k = (3)
result, power factors and beamforming vectors. In the power N K
|hn,k wi |2 Pi +|hn,k wn |2
P P
assignment, a BPNN is designed to learn the relationship αn,j Pn +σ 2
i=1,i6=n j=k+1
between the power allocation factors and the users’ channel
Furthermore, with the objective of maximum sum rate, the
state information (CSI) for each user cluster. As for the joint optimization problem X
could be written as follows:
beamforming, some traditional methods, such as the zero max Rsum =
X N K
Blog2 (1 + Φn,k ) (4)
forcing (ZF) algorithm and the other optimized beamforming {αn,k },{wn },{Un,k } n=1 k=1

schemes could be directly used among different user clusters. XN


s.t.C1 : αn,k ≤ 1,αn,k ∈ [0, 1]
Simulation experiments show that the iterative process can n=1
XN
converge after dozens of iterations and the system performance C2 : Pn ≤ P
n=1
could approximate the case of exhaustive search scheme.
C3 : Rn,k ≥ Rmin
II. S YSTEM MODEL AND PROBLEM FORMULATION C4 : kW k2 = 1

Consider a single-cell multi-users downlink system. The where B is the bandwidth of one user channel. C1 is the
base station (BS) is equipped with Nt antennas and serves power allocation factor constraint for each user cluster and
L single-antenna users. All users in the cell are divided into C2 denotes the total power constraint of the BS over one user
N clusters, each of which includes K users. The users’ channel. C3 can ensure the minimum data rate for each user
data in one cluster are transmitted in the form of power- and C4 is the norm constraint of the beamforming matrix.
domain NOMA signal structure and preprocessed by the same Because the joint optimization problem (4) is non-
beamforming vector. Assume that the BS deploys antennas on convex[5], the traditional solutions are always the heuristic
the Y -Z plane in terms of the uniform planar array (UPA). or the alternating iterative methods. These methods exist the
Then, the channel vector from the BS to user m can be drawbacks of high complexity or limited performance. While,
modeled similarly as in [14], which is given by DRL can not only fully explore the hidden information of big
q
1 X
Lu data to improve its own learning performance, but also realize
hm = ( dm/d0 )µ √ gm,l b(vm,l ) ⊗ a(um,l ) (1) the dynamic real-time interaction. This method has strong
Lu l=1
generalization ability and highlights its advantages in wireless
Here d0 is the radius of cell and dm is the distance between the RA. Therefore, this letter proposes a joint optimization method
user and the BS. µ is the large-scale fading factor. b(vm,l ) and to solve the problem (4) based on the DRL in Section III.
a(um,l ) are the vertical and horizontal array response vectors
of the UPA antenna, respectively. The symbol “⊗” in formula III. DQN BASED -R ESOURCE A LLOCATION SCHEME
(1) represents the kronecker product of two matrices. Lu is
the number of scattering paths and g denotes the small-scale The proposed RA network based on DQN is shown in Fig.1
fading coefficient. and it includes three parts: user clustering, power allocation
Assume the data sent by the BS being X = [x1 · · · xN ]T ∈ and beamforming. While, the first two parts are mainly con-
XK p cerned in this section.
C N ×1 , where xn = αn,k Pn sn,k is the superposed
k=1
NOMA signal of K users in the n-th cluster. Here, Pn is the
A. User Clustering based on DQN
total power of the n-th cluster. αn,k and sn,k are the power
allocation factor and the transmitted symbol of the k-th user in The user clustering problem is modeled as a RL task, which
the n-th cluster, denoted by Un,k , respectively. It is assumed consists of the agent and the environment. Specifically, the user
2
that E[|sn,k | ] = 1. The received signal of user Un,k is clustering module is taken as an agent and the performance of
p the massive MIMO-NOMA system is the environment. The
yn,k = hn,k W X +zn,k = hn,k wn αn,k Pn sn,k+ actions {at } taken by the agent are based on the expected
K N
X p X (2) rewards from the environment. According to the considered
hn,k wn αn,k Pn sn,k +hn,k wi xi +zn,k
j=1,j6=k i=1,i6=n
system, each part of the RL framework is described as follows:

1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 3

State space S: The CSI of all users in each cluster forms from the whole action space with the exploration probability
the current state of the t-th iteration, which is given by st = ε and determine an action to receive the maximum Q-value
{[h1,1 , · · · h1,K ], · · · [hN,1 , · · · , hN,K ]}. according to the Q-network output with the probability (1−ε).
Action space A: It should contain actions that can cover all
possible clustering results. When there are L users, the number B. Power Allocation based on BPNN
of possible actions reaches to CLK CL−K K K
· · · CL−(N −1)K /N !. In order to ensure the effectiveness of the SIC receiver in
The size of the action space will increase dramatically power-domain NOMA, the power of users in the same cluster
with the increase of users. The purpose of taking ac- needs to be assigned appropriately. Power allocation is key
tion is to select a suitable group for each user. Taking to compromise between the system sum rate and the users’
t t t t
the action at = {[U1,1 , · · · , U1,K ], · · · , [UN,1 , · · · , UN,K ]} fairness. Different from the traditional optimization algorithm,
for the state st will result in the new state st+1 . a BPNN based power allocator is designed to reduce the
at
This impact is defined as st → st+1 . Here st+1 is computation complexity as well as obtain a good performance.
{[h1,1 , · · · , h1,K ], · · · , [hN,1 , · · · , htN,K ]}, which fully corre-
t t t The task of power allocation is to calculate the users’
sponds to the new user grouping result at . And htn,k is the power allocation factors αn = [αn,1 ,...,αn,K ], αn,k ∈ [0, 1]
t for each group under a given user clustering result. In order
CSI of the user Un,k , i.e., the k-th user in the n-th group for
the current action at . to explore the nonlinear mapping ability of the BPNN’s to
Reward function: Here, the system sum rate rt = Rsum extract the relationship between the power allocation and the
is taken as the reward function, which is also related with users’ CSI among a cluster, the BPNN needs to be trained
{αn,k } and {wn }. The ideal goal ofX RL is to maximize the based on large amounts of labeled data. Here, the result of
∞ exhaustive search based power allocation (ESPA) algorithm
cumulative discounted reward Rt = γ i rt+i , where the
i=0 executed among a finite power allocation factor set, α b n is
discount factor is γ ∈ [0, 1]. Obviously, the action of each used as the training label for BPNN. Here, the finite set
iteration can only be finalized after all iterations have been for ESPA is obtained by discretizing the continuous factor
completed with such a target. Specifically, a state-action value range with a small step size. In order to ensure the fairness
(Q-value) function defined in (5)[15], which can determine the of users, the constraint Rn,k ≥ Rmin is realized by the
current action only based on the next iteration’s value function, ESPA. Because the beamforming matrix are unknown when
is used in Q-learning. In order to find the optimal at for the generating the training data for the BPNN, the calculation
given state st to make the Q-value maximized, Eq.(5) has to of Rn,k does not involve W and its expression is Rn,k =
be calculated for all actions. If L is very large, the complexity 2 2
XK
will be extremely high and the algorithm converges slowly. B log(1+khn,k k αn,k Pn /(khn,k k αn,j Pn + σ 2 )).
j=k+1
The BPNN consists of input layer, output layer and hidden
Q(st , at ) = E[rt + γ max Q(st+1 , at+1 ) |st , at ] (5)
at+1 layers. The hidden layers of the BPNN adopt the rectified
Deep Q-Network: To speed up the convergence of Q- linear unit (ReLu) activation function [16] and other
learning, a deep Q-Network is adopted to estimate the Q- layers use the linear activation function. The input of the
values. A fully connected neural network with three hidden BPNN is channel information {khn,1 k , · · · , khn,K k} of
layers is involved. Its input and output are the current state one cluster’s users, and the output is the corresponding
and the estimated Q-values corresponding to each action, power allocation factors αn . The loss function is defined as
respectively. The function of the Q-network is represented as Loss = min kαn −α b n k2 and the stochastic gradient descent
ωB ,bB
Q (st , at , ω), where ω is the weight set to be trained. In order (SGD) method is used to update the network parameters
to ensure the convergence of Q-network’s parameters updating, {ωB , bB } in training. The trained network can directly
a target Q-network, which has the same structure and the initial calculate the power allocation result based on the channel
weights as the Q-network but keeps the old weights ω C− of information and the online calculation complexity could be
C iterations ago during working, is included to provide the greatly reduced.
relatively stable labels for Q-network training. Hence, the loss Obviously, different BPNNs need to be trained for different
function used in training is given by configurations of K. In our work, the number users K in
L(ω) = E[(r + γ max
0
Q(s0 , a0 , ω C− ) − Q(s, a, ω))2 ] (6) every cluster is assumed identical. Then, the same pre-trained
a
BPNN can be used by all the clusters. In a practical system,
where the label is y = r + γ max
0
Q(s0 , a0 , ω C− ) calculated the parameter K might be different for different user clusters
a
based on the old weights ω C− and the new state s0 after according to the users’ access requirement and their CSI.
taking action a from state s. In addition, the experience In this case, multiple BPNNs need to be pre-trained for
replay strategy used in [15] is adopted in training. First, the all the cases of K. Fortunately, the possible values of K
observation samples (st , at , rt , st+1 ) in previous iterations are is limited in a very small set. From the various references
stored. Then a mini-batch of samples are randomly selected surveyed, it can be seen that the number of superimposed
from the replay memory to feed into the target Q-network to users in a NOMA signal structure is usually selected as 2
get the training labels. The objective of this process is to break or 3[4][12][17][18]. Actually, with K being larger than 3, the
the correlations among data samples and make training conver- error propagation in the SIC decoder becomes more serious
gent. Following the Q-network, ε-greedy strategy is utilized to and the total system performance will decrease instead. This
choose the current action. That is to select an action randomly result can be seen from comparing the simulation curves in

1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 4

TABLE I 20
S IMULATION PARAMETERS

Spectrum efficiency (bit/s/Hz)


System parameters Value Network parameters Value
Cell radius 1km Batchsize 32 15
Number of antennas 32 Loop 500
Number of users 8, 12 Episode 10
Number of clusters 4, 2 Learning rate 0.01
Transmission power 0.08-4 W Database capacity 200 10
Noise power 1mW C 10
NLUPA[6]-FTPA[7]
Large-scale fading factor 3.6 Greedy factor 0.9
ES-BPNN
User rate constraints 2 bit/s/Hz Discount factor 0.8
DQN-BPNN (proposed)
Fig.4 and Fig.2. Since K is suitable for 2 or 3, we only need to 5 DQN-FTPA[7]
ES-FTPA[7]
train two BPNNs in advance. Then, the corresponding network ES-ESPA
structures and parameters could be saved with an acceptable
memory capacity requirement. During work, the BS can call 0 1 2 3 4
Power (W)
the needed BPNN based on the actual number of users in each
cluster. Through such process, the proposed BPNN PA scheme Fig.2 The system total spectrum efficiency (L=8,K=2)
becomes feasible even if the number of users per cluster is not
fixed. 30

Spectrum efficiency (bit/s/Hz)


IV. S IMULATION R ESULTS 25

In the simulation, one BS is located in the center of the 20


cell and users are randomly and uniformly distributed in the
cell. The specific simulation parameters are shown in Table 15
1. For the power allocation network, 100,000 groups of data
are used to train the BPNN. The step size of power factor’s NLUPA[6]-FTPA[7]
10
discretization is 0.01. In order to guarantee the generalization ES-BPNN
performance of the BPNN, additional 20000 groups of data are DQN-BPNN (proposed)
used to test the network. In all schemes, the ZF beamforming 5 DQN-FTPA[7]
ES-FTPA[7]
algorithm is selected, and the beamforming vector is calculated ES-ESPA
based on the CSI with the best quality in each cluster. Five 0
0 1 2 3 4
benchmark schemes correspond to the different combinations Power (W)
of user clustering methods, namely DQN, NLUPA and ex-
haustive search (ES), and power allocation schemes, namely Fig.3 The system total spectrum efficiency (L=12,K=2)
FTPA, BPNN and ESPA. In addition, Fig.3 shows the simulation results as the number
In Fig.2, the system spectrum efficiency of different of user increases to 12. Here, the number of users per cluster
schemes are compared when L=8 and K=2. The total trans- is still 2, so the same pre-trained BPNN as used in the
mission power range is 0.08-4 W . The BPNN was adjusted to simulation of Fig.2 is adopted. From Fig.3, it can be found
5 layers and the numbers of three hidden layers’ nodes are set that the proposed DQN-BPNN RA scheme still maintains
as 32,64 and 32 through many times of experiments. It can be a performance comparable to the ES-ESPA case and has
observed that the scheme DQN-FTPA outperforms the scheme a significant gain over the NLUPA-FTPA method for the
NLUPA-FTPA due to using the DQN based user clustering, scenarios with more users. Furthermore, we also find that the
which takes into account the real-time interaction with the total system spectrum efficiency increases about 7 bit/s/Hz at
current environment. The result of DQN almost reaches to the power of 4W, when the number of users improves from 8
the one of the ES clustering, which can be taken as the to 12.
upper bound of the user clustering’s performance. Besides, What’s more, we change the number of intra-cluster users
the system spectrum efficiency could be improved by about and observe its effect on the system performance, just shown in
3bit/s/Hz further with the power being above 1W, through Fig.4. Here, K increases from 2 to 4 while L keeps to 8. In this
adopting the BPNN power allocation simultaneously above case, the BPNN needs to be retrained. The network is adjusted
the DQN user clustering. Furthermore, the simulation curves to having 5 hidden layers, where the numbers of nodes are
corresponding to the ESPA, BPNN based PA and FTPA with 32, 64, 128, 64 and 32, respectively. It can be observed that
adopting the ES user clustering are given in Fig.2, respectively. our scheme has a larger performance gain over the NLUPA-
From these curves, we can find that the proposed BPNN PA FTPA method for this case. However, we find that the system
outperforms the FTPA about 3 bit/s/Hz with the power of spectrum efficiency for all the schemes under K=4 reduces
4W. Its performance is almost same as the ESPA, but the significantly when compared to the ones in Fig.2 (K=2). This
complexity could be reduced obviously by using the pre- is just caused by the serious intra-cluster error propagation in
trained BPNN. the SIC decoder. The simulation work in Fig.4 can show that

1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/LCOMM.2021.3055348, IEEE
Communications Letters
IEEE COMMUNICATIONS LETTERS, VOL. XX, NO. XX, XXXX 2021 5

14 V. C ONCLUSIONS
This work mainly studies the downlink RA problem in
Spectrum efficiency (bit/s/Hz)
12 the massive MIMO-NOMA system. In order to maximize
the system spectrum efficiency under the premise of ensuring
10 the worst user performance constraint, a deep Q-learning
network and a BP neural network are designed to realize the
8 joint user clustering and the intra-cluster power allocation,
respectively. The simulation results demonstrate the advantage
6 NLUPA[6]-FTPA[7] of our scheme on improving system spectrum efficiency.
ES-BPNN
DQN-BPNN (proposed)
4 DQN-FTPA[7] R EFERENCES
ES-FTPA[7] [1] F. Tang, Y. Kawamoto, N. Kato and J. Liu, “Future Intelligent and
ES-ESPA Secure Vehicular Network Toward 6G: Machine-Learning Approaches,”
2 Proceedings of the IEEE, vol. 108, no. 2, pp. 292-307, Feb. 2020.
0 1 2 3 4
Power (W) [2] Z. Shi, W. Gao, S. Zhang, J. Liu and N. Kato, “AI-Enhanced Cooperative
Spectrum Sensing for Non-Orthogonal Multiple Access,” IEEE Wireless
Fig.4 The system total spectrum efficiency (L=8,K=4) Communications, vol. 27, no. 2, pp. 173-179, April 2020.
[3] Zhang G, Wang B, Li G, et al., “Interference Management by Vertical
Beam Control Combined with Coordinated Pilot Assignment and Power
CDF Allocation in 3D Massive MIMO Systems,” KSII Trans. Internet and Inf.
1
Syst.,vol. 9, no. 8, pp. 2797-2820, Aug. 2015.
[4] Z. Ding, F. Adachi and H. V. Poor, “The Application of MIMO
to Non-Orthogonal Multiple Access,” IEEE Transactions on Wireless
0.8 Communications, vol. 15, no. 1, pp. 537-552, Jan. 2016.
[5] Y. Sun, D. W. K. Ng, Z. Ding and R. Schober, “Optimal Joint Power
and Subcarrier Allocation for Full-Duplex Multicarrier Non-Orthogonal
0.6 Multiple Access Systems,” IEEE Transactions on Communications, vol.
65, no. 3, pp. 1077-1091, March 2017.
F(R)

[6] S. M. R. Islam, M. Zeng, O. A. Dobre and K. Kwak, “Resource


Allocation for Downlink NOMA Systems: Key Techniques and Open
0.4 Issues,” IEEE Wireless Commun., vol. 25, no. 2, pp. 40-47, April 2018.
[7] A. Benjebbovu, A. Li, Y. Saito, Y. Kishiyama, A. Harada and T.
Nakamura, “System-level Performance of Downlink NOMA for Future
0.2 Ideal situation LTE Enhancements,” 2013 IEEE Globecom Workshops, Atlanta, GA,
pp. 66-70, 9-13 Dec. 2013.
ES-FTPA[7]
[8] S. Chinnadurai, P. Selvaprabhu, M. H. Lee, “A Novel Joint User Pairing
DQN-BPNN (proposed) and Dynamic Power Allocation Scheme in MIMO-NOMA System,”
0
0 2 4 6 8 10 2017 Int. Conf. Inf. Commun. Technol. Converg., Jeju, South Korea,
R(bit/s/Hz) pp. 951-953, 18-20 Oct. 2017.
[9] Y. Liu, M. Elkashlan, Z. Ding, G. K. Karagiannidis, “Fairness of User
Fig.5 CDF curves of the user’s spectrum efficiency Clustering in MIMO Non-orthogonal Multiple Access Systems,” IEEE
Communications Letters, vol. 20, no. 7, pp. 1465-1468, July 2016.
the given BPNN based PA is applicable for the other case of [10] Ruifang Zhu, Guomei Zhang, “A Segment-Average based Channel Esti-
K and K is suitably set as lower than 4 due to the limitation mation Scheme for One-bit Massive MIMO Systems with Deep Neural
Network,” 2019 IEEE International Conference on Communication
of the intra-cluster error propagation. Technology, Xi’an, China, pp. 81-86, 16-19 Oct. 2019.
Fig.5 shows the CDF curve of user’s spectrum efficiency [11] C. He, Y. Hu, Y. Chen, B. Zeng, “Joint Power Allocation and Channel
Assignment for NOMA with Deep Reinforcement Learning,” IEEE J.
got by the proposed scheme and the scheme ES-FTPA under Sel. Areas Commun., vol. 37, no. 10, pp. 2200-2210, Oct. 2019.
the total transmission power of 4 W , L=8 and K=2. The [12] M. Liu, T. Song and G. Gui, “Deep Cognitive Perspective: Resource
dotted line is the performance of the case that the inter-cluster Allocation for NOMA-Based Heterogeneous IoT With Imperfect SIC,”
IEEE Internet Things J., vol. 6, no. 2, pp. 2885-2894, April 2019.
interference is ignored. It shows that the scheme DQN-BPNN [13] G. Gui, H. Huang, Y. Song, H. Sari, “Deep Learning for An Effective
obtains an obvious gain of the system spectrum efficiency Nonorthogonal Multiple Access Scheme,” IEEE Transactions on Vehic-
against the traditional method. However, the performance of ular Technology, vol. 67, no. 9, pp. 8440-8450, Sep. 2018.
[14] D. Ying, F. W. Vook, T. A. Thomas, et al., “Kronecker Product
the edge users improves just slightly and has a big gap from the Correlation Model and Limited Feedback Codebook Design in a 3D
ideal situation. Although the minimum rate constraint for users Channel Model,” 2014 IEEE International Conference on Communica-
has been considered in BPNN, the inter-cluster interference, tions, Sydney, NSW, pp. 5865-5870, 10-14 June 2014.
[15] S. Wang, H. Liu, P. H. Gomes, et al , “Deep Reinforcement Learning
which cannot be suppressed by ZF beamforming, still worsens for Dynamic Multichannel Access in Wireless Networks,” IEEE Trans-
the final performance. Moreover, in the ideal situation without actions on Cognitive Communications and Networking, vol. 4, no. 2,
inter-cluster interference, the lowest user’s spectrum efficiency pp. 257-265, June 2018.
[16] A. Krizhevsky, I. Sutskever, G.E. Hinton, “ImageNet Classification with
cannot be kept on Rmin . This is because FPTA rather than Deep Convolutional Neural Networks,” Proceedings of NIPS, IEEE
ESPA is used to generate the training labels in some extreme Neural Inf. Process. Syst. Founda., pp.1097-1105, Jan. 2012.
cases. In such cases, Rmin cannot be realized for few users [17] “TP for Classification of MUST Schemes,” document R1-154999, 3GPP,
2015.
no matter how to adjust the users’ power within a cluster due [18] J. M. Kang, I. M. Kim and C. J. Chun, “Deep Learning-Based MIMO-
to their very poor channel conditions. But nearly 90% users NOMA With Imperfect SIC Decoding,” IEEE Systems Journal, vol. 14,
still achieve the better performance than Rmin by the proposed no. 3, pp. 3414-3417, Sept. 2020.
scheme.

1089-7798 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Higher College of Technology. Downloaded on March 20,2021 at 10:03:52 UTC from IEEE Xplore. Restrictions apply.

You might also like