CN116406004A

CN116406004A - Construction method and resource management method of wireless network resource allocation system

Info

Publication number: CN116406004A
Application number: CN202310354794.0A
Authority: CN
Inventors: 时宁哲; 刘玲; 周一青; 石晶林
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-07
Also published as: WO2024207564A1

Abstract

The invention provides a construction method of a wireless network resource allocation system, which is used for obtaining a wireless network resource allocation strategy according to a wireless network state, and comprises the following steps: s1, acquiring a non-convex optimization target with interruption probability constraint corresponding to wireless communication requirements in a non-perfect global channel state information environment; s2, converting the obtained non-convex optimization target to obtain a non-convex optimization target without interruption probability constraint; s3, acquiring imperfect global channel state information of the wireless network; s4, training the initial resource allocation system to be converged by taking the obtained non-convex optimization target as a training target and taking the imperfect global channel state information in the step S3 as input and adopting a reinforcement learning mode. The invention adopts more practical CSI (imperfect global channel state information) to train the initial distribution system based on learning, improves the convergence rate of the wireless network resource distribution system and improves the performance of completing the optimization target.

Description

Construction method and resource management method of wireless network resource allocation system

技术领域Technical Field

本发明涉及无线通信领域，具体来说，涉及无线通信网络资源分配领域，更具体地说，涉及一种无线网络资源分配系统的构建方法、基于此的无线网络资源管理方法及无线通信系统。The present invention relates to the field of wireless communications, in particular to the field of wireless communications network resource allocation, and more particularly to a method for constructing a wireless network resource allocation system, a wireless network resource management method based thereon, and a wireless communications system.

背景技术Background Art

现有技术中，通过增加无线通信网络中的空间频谱重用率以及在无线通信网络中大量部署无线接入点(access point，AP)来提高无线网络中用户的传输速率和网络容量。然而，在无线接入点密集且不规则的部署场景中，无线通信网络中会有特别严重的同信道干扰(co-channel interference，CCI)。此外，随着无线通信网络中部署的基站(basestation，BS)数量的增加，无线网络资源的分配不合理还会进一步的增加CCI并降低无线网络的频谱效率等通信性能。因此，需要通过优化无线网络资源分配(例如信道分配策略和功率分配策略)，来降低无线网络中的CCI并提高无线网络的频谱效率等通信性能。In the prior art, the transmission rate and network capacity of users in a wireless network are improved by increasing the spatial spectrum reuse rate in the wireless communication network and deploying a large number of wireless access points (AP) in the wireless communication network. However, in scenarios where wireless access points are densely and irregularly deployed, there will be particularly serious co-channel interference (CCI) in the wireless communication network. In addition, with the increase in the number of base stations (BS) deployed in the wireless communication network, the unreasonable allocation of wireless network resources will further increase the CCI and reduce the communication performance such as the spectrum efficiency of the wireless network. Therefore, it is necessary to reduce the CCI in the wireless network and improve the communication performance such as the spectrum efficiency of the wireless network by optimizing the allocation of wireless network resources (such as channel allocation strategy and power allocation strategy).

现有技术中解决无线网络中资源分配问题的主要方法有两类，即基于模型驱动的优化算法和基于学习的优化算法。There are two main methods for solving the resource allocation problem in wireless networks in the prior art, namely, model-driven optimization algorithms and learning-based optimization algorithms.

其中，基于模型驱动的优化算法，通常是假设完美的全局信道状态信息(channelstate information，CSI)来优化资源分配问题，当将其应用到实际无线通信环境中时，会具有过高的计算复杂度从而导致较大的时延以及较高的能耗的缺点，在解决无线网络中资源分配问题中的性能次优，在实际中很难进行部署及应用。Among them, model-driven optimization algorithms usually assume perfect global channel state information (CSI) to optimize resource allocation problems. When applied to actual wireless communication environments, they have the disadvantages of too high computational complexity, resulting in large delays and high energy consumption. Their performance in solving resource allocation problems in wireless networks is suboptimal, and they are difficult to deploy and apply in practice.

而基于学习的优化算法通常是基于深度强化学习来实现优化。其中，深度强化学习基于深度学习强大的感知能力来处理复杂的、高维的环境特征，并结合强化学习的思想与环境进行交互，完成决策过程，因此，深度强化学习(deep reinforcement learning，DRL)被成功应用于各种领域(无人驾驶决策、工业机器人控制以及推荐系统)。对于无线通信领域而言，由于无线通信环境的动态性，无线通信环境中的资源分配也可以被建模为动态决策过程。因此，基于深度强化学习的无线资源管理的方法应用于无线资源分配任务可以解决传统无线资源分配方法存在的问题。相比于基于模型驱动的资源优化算法，基于学习的优化算法可以有效降低资源分配的计算复杂度，更可能在未来的无线网络架构中进行部署及应用。现有技术中，无线通信技术领域中，基于学习的优化算法普遍使用完美CSI进行无线网络中的资源分配。但是由于信道估计的误差以及信道反馈的时延的客观存在，使得真正的完美CSI很难获取到，因此，在无线资源管理任务中，考虑无线环境中更切合实际的非完美CSI是非常必要的，这从参考文献[1]-[8]中的研究就可以看到，基于非完美CSI的优化更加切合实际。但是，正如前面所述的，现有技术下的基于学习的优化方法普遍是基于完美CSI的，例如参考文献[3]-[7]、[9]均基于完美CSI来设计优化目标，算法收敛速度较慢，且频谱效率等性能较低。并且，从参考文献[10]-[12]的研究中可以看到，完美CSI在实际环境中是很难获得的。The learning-based optimization algorithm is usually based on deep reinforcement learning to achieve optimization. Among them, deep reinforcement learning uses the powerful perception ability of deep learning to process complex and high-dimensional environmental features, and combines the idea of reinforcement learning to interact with the environment to complete the decision-making process. Therefore, deep reinforcement learning (DRL) has been successfully applied to various fields (unmanned driving decision-making, industrial robot control, and recommendation systems). For the field of wireless communications, due to the dynamic nature of the wireless communication environment, resource allocation in the wireless communication environment can also be modeled as a dynamic decision-making process. Therefore, the application of the wireless resource management method based on deep reinforcement learning to the wireless resource allocation task can solve the problems existing in the traditional wireless resource allocation method. Compared with the model-driven resource optimization algorithm, the learning-based optimization algorithm can effectively reduce the computational complexity of resource allocation and is more likely to be deployed and applied in the future wireless network architecture. In the prior art, in the field of wireless communication technology, the learning-based optimization algorithm generally uses perfect CSI for resource allocation in wireless networks. However, due to the objective existence of channel estimation errors and channel feedback delays, it is difficult to obtain truly perfect CSI. Therefore, in wireless resource management tasks, it is very necessary to consider more realistic imperfect CSI in the wireless environment. As can be seen from the research in references [1]-[8], optimization based on imperfect CSI is more realistic. However, as mentioned above, the learning-based optimization methods in the existing technology are generally based on perfect CSI. For example, references [3]-[7] and [9] all design optimization targets based on perfect CSI, and the algorithm converges slowly, and the performance such as spectrum efficiency is low. In addition, it can be seen from the research in references [10]-[12] that perfect CSI is difficult to obtain in a practical environment.

综上所述，现有的基于学习的方法没有针对非完美CSI进行设计，并且实际的通信环境中由于信道估计误差是客观存在且无法完全消除，直接采用现有的基于学习的算法在非完美CSI环境下所能达到的优化目标效果(即通信性能)较差且算法的收敛速度较低。因此，亟需一种更加有效的DRL架构，能够基于非完美CSI来进行无线网络中的资源分配策略优化。In summary, the existing learning-based methods are not designed for imperfect CSI, and in actual communication environments, since channel estimation errors exist objectively and cannot be completely eliminated, the optimization target effect (i.e., communication performance) that can be achieved by directly using existing learning-based algorithms in imperfect CSI environments is poor and the convergence speed of the algorithm is low. Therefore, there is an urgent need for a more effective DRL architecture that can optimize resource allocation strategies in wireless networks based on imperfect CSI.

参考文献：References:

[1]Y.Teng,M.Liu,F.R.Yu,V.C.M.Leung,M.Song,and Y.Zhang,“Resourceallocation for ultra-dense networks:A survey,some research issues andchallenges,”IEEE Commun.Surv.Tut.,vol.21,no.3,pp.2134–2168,Jul.–Sep.2019.[1] Y. Teng, M. Liu, F. R. Yu, V. C. M. Leung, M. Song, and Y. Zhang, "Resource allocation for ultra-dense networks: A survey, some research issues and challenges," IEEE Commun. Surv. Tut. ,vol.21,no.3,pp.2134–2168,Jul.–Sep.2019.

[2]L.Liu,Y.Zhou,W.Zhuang,J.Yuan,and L.Tian,“Tractable coverageanalysis for hexagonal macrocell-based heterogeneous UDNs with adaptiveinterference-aware CoMP,”IEEE Trans.Wireless Commun.,vol.18,no.1,pp.503–517,Jan.2019.[2] L.Liu, Y.Zhou, W.Zhuang, J.Yuan, and L.Tian, “Tractable coverage analysis for hexagonal macrocell-based heterogeneous UDNs with adaptiveinterference-aware CoMP,” IEEE Trans.Wireless Commun., vol. 18, no.1, pp.503–517, Jan.2019.

[3]Y.Zhang,C.Kang,T.Ma,Y.Teng,and D.Guo,“Power allocation in multi-cell networks using deep reinforcement learning,”in Proc.IEEE 88thVeh.Technol.Conf.(VTC-Fall),2018,pp.1–6.[3]Y.Zhang, C.Kang, T.Ma, Y.Teng, and D.Guo, "Power allocation in multi-cell networks using deep reinforcement learning," in Proc.IEEE 88thVeh.Technol.Conf.(VTC -Fall), 2018, pp.1–6.

[4]S.Lahoud,K.Khawam,S.Martin,G.Feng,Z.Liang,and J.Nasreddine,“Energy-efficient joint scheduling and power control in multicell wirelessnetworks,”IEEE J.Sel.Areas Commun.,vol.34,no.12,pp.3409–3426,Dec.2016.[4] S. Lahoud, K. Khawam, S. Martin, G. Feng, Z. Liang, and J. Nasreddine, "Energy-efficient joint scheduling and power control in multicell wireless networks," IEEE J. Sel. Areas Commun. ,vol.34,no.12,pp.3409–3426,Dec.2016.

[5]K.Shen and W.Yu,“Fractional programming for communicationsystems—Part I:Power control and beamforming,”IEEE Trans.Signal Process.,vol.66,no.10,pp.2616—2630,May 2018.[5] K.Shen and W.Yu, "Fractional programming for communication systems—Part I: Power control and beamforming," IEEE Trans.Signal Process., vol.66, no.10, pp.2616—2630, May 2018.

[6]F.Meng,P.Chen,L.Wu,and J.Cheng,“Power allocation in multi-usercellular networks:Deep reinforcement learning approaches,”IEEE Trans.WirelessCommun.,vol.19,no.10,pp.6255–6267,Oct.2020.[6] F.Meng, P.Chen, L.Wu, and J.Cheng, “Power allocation in multi-user cellular networks: Deep reinforcement learning approaches,” IEEE Trans.WirelessCommun., vol.19, no.10, pp .6255–6267,Oct.2020.

[7]J.Tan,Y.-C.Liang,L.Zhang,and G.Feng,“Deep reinforcement learningfor joint channel selection and power control in D2D networks,”IEEETrans.Wireless Commun.,vol.20,no.2,pp.1363–1378,Feb.2021.[7] J.Tan, Y.-C.Liang, L.Zhang, and G.Feng, "Deep reinforcement learning for joint channel selection and power control in D2D networks," IEEETrans.Wireless Commun., vol.20, no. 2, pp.1363–1378, Feb. 2021.

[8]Y.Guo,F.Zheng,J.Luo,and X.Wang,“Optimal resource allocation viamachine learning in coordinated downlink multi-cell OFDM networks underimperfect CSI,”in Proc.Veh.Technol.Conf.(VTC-Spring),2020,pp.1–6[8] Y.Guo, F.Zheng, J.Luo, and X.Wang, “Optimal resource allocation via machine learning in coordinated downlink multi-cell OFDM networks underimperfect CSI,” in Proc.Veh.Technol.Conf.(VTC- Spring), 2020, pp.1–6

[9]Y.S.Nasir and D.Guo,“Deep Reinforcement Learning for JointSpectrum and Power Allocation in Cellular Networks,”2021 IEEE GlobecomWorkshops(GC Wkshps),2021,pp.1-6.[9] Y.S.Nasir and D.Guo, "Deep Reinforcement Learning for JointSpectrum and Power Allocation in Cellular Networks," 2021 IEEE GlobecomWorkshops (GC Wkshps), 2021, pp.1-6.

[10]T.Yoo and A.Goldsmith,“Capacity and power allocation for fadingMIMO channels with channel estimation error,”IEEE Trans.Inf.Theory,vol.52,no.5,pp.2203–2214,May 2006.[10] T.Yoo and A.Goldsmith, "Capacity and power allocation for fadingMIMO channels with channel estimation error," IEEE Trans.Inf.Theory, vol.52, no.5, pp.2203–2214, May 2006.

[11]F.Fang,H.Zhang,J.Cheng,S.Roy,and V.C.M.Leung,“Joint userscheduling and power allocation optimization for energy-efficient NOMAsystems with imperfect CSI,”IEEE J.Sel.Areas Commun.,vol.35,no.12,pp.2874–2885,Dec.2017.[11] F. Fang, H. Zhang, J. Cheng, S. Roy, and V. C. M. Leung, “Joint users scheduling and power allocation optimization for energy-efficient NOMAsystems with imperfect CSI,” IEEE J. Sel. Areas Commun., vol. .35, no.12, pp.2874–2885, Dec.2017.

[12]X.Wang,F.-C.Zheng,P.Zhu,and X.You,“Energy-efficient resourceallocation in coordinated downlink multicell OFDMA systems,”IEEETrans.Veh.Technol.,vol.65,no.3,pp.1395–1408,Mar.2016.[12]X.Wang, F.-C.Zheng, P.Zhu, and ,pp.1395–1408,Mar.2016.

发明内容Summary of the invention

因此，本发明的目的在于克服上述现有技术的缺陷，提供一种无线网络资源分配系统的构建方法、基于此的无线网络资源管理方法及无线通信系统。Therefore, the purpose of the present invention is to overcome the defects of the above-mentioned prior art and provide a method for constructing a wireless network resource allocation system, a wireless network resource management method based thereon and a wireless communication system.

本发明的目的是通过以下技术方案实现的：The objective of the present invention is achieved through the following technical solutions:

根据本发明的第一方面，提供一种无线网络资源分配系统的构建方法，所述无线网络资源分配系统用于根据无线网络状态获得无线网络资源分配策略，所述方法包括：S1、获取在非完美全局信道状态信息环境下无线通信需求对应的具有中断概率约束的非凸优化目标；S2、对步骤S1获得的非凸优化目标进行转换以获得不含中断概率约束的非凸优化目标；S3、获取无线网络的非完美全局信道状态信息；S4、以步骤S2中的非凸优化目标为训练目标，并以步骤S3中的非完美全局信道状态信息为输入，采用强化学习的方式训练初始资源分配系统至收敛，其中，所述初始资源分配系统为基于智能体构建的且用于基于无线网络状态生成动作集合的系统，所述动作集合包括信道分配策略和功率分配策略。According to a first aspect of the present invention, a method for constructing a wireless network resource allocation system is provided, wherein the wireless network resource allocation system is used to obtain a wireless network resource allocation strategy according to a wireless network state, and the method comprises: S1, obtaining a non-convex optimization target with an interruption probability constraint corresponding to a wireless communication demand in an environment of non-perfect global channel state information; S2, converting the non-convex optimization target obtained in step S1 to obtain a non-convex optimization target without an interruption probability constraint; S3, obtaining non-perfect global channel state information of the wireless network; S4, using the non-convex optimization target in step S2 as a training target, and using the non-perfect global channel state information in step S3 as input, and adopting a reinforcement learning method to train an initial resource allocation system until convergence, wherein the initial resource allocation system is a system constructed based on an intelligent agent and used to generate an action set based on a wireless network state, and the action set includes a channel allocation strategy and a power allocation strategy.

在本发明的一些实施例中，所述无线通信需求为最大化无线网络的频谱效率，所述具有中断概率约束的非凸优化目标：In some embodiments of the present invention, the wireless communication requirement is to maximize the spectrum efficiency of the wireless network, and the non-convex optimization objective with an interruption probability constraint is:

其中，

in,

其中，

表示无线网络在时隙t的平均频谱效率,K表示链路总数N表示子信道总数，

表示子信道索引的集合，

表示第k条链路在时隙t选择第n个子信道的调度频谱效率，

表示第k条链路在时隙t选择第n个子信道的最大频谱效率，

表示第k条链路在时隙t选择第n个子信道的估计小尺度衰落分量，

表示在估计小尺度衰落分量

为条件下

的概率，

表示第k条链路在时隙t选择第n个子信道的功率，p^t表示所有

功率的集合，

表示第k条链路在时隙t选择第n个子信道后的标识值，α^t表示所有

标识值组成的集合，ε_out表示期望的中断概率，P_max表示链路的功率阈值，约束M1表示在估计小尺度衰落分量

为条件下任意一条链路在时隙t选择任意一个子信道后被中断的概率要小于期望的中断概率，约束M2表示每条链路上的发射功率不能高于链路的功率阈值，约束M3和M4表示每条链路在每个时隙只能选择一个子信道。in,

represents the average spectrum efficiency of the wireless network in time slot t, K represents the total number of links, and N represents the total number of subchannels.

represents a set of subchannel indices,

represents the scheduling spectrum efficiency of the kth link selecting the nth subchannel in time slot t,

represents the maximum spectral efficiency of the kth link selecting the nth subchannel in time slot t,

represents the estimated small-scale fading component of the nth subchannel selected by the kth link at time slot t,

Indicates that in estimating the small-scale fading component

Under the condition

The probability of

represents the power of the nth subchannel selected by the kth link in time slot t, and p ^t represents all

A collection of power,

represents the identification value of the kth link after selecting the nth subchannel in time slot t, and α ^t represents all

The set of identification values, ε _out represents the expected outage probability, P _max represents the power threshold of the link, and constraint M1 represents the estimation of the small-scale fading component.

Under the condition, the probability of any link being interrupted after selecting any subchannel in time slot t is less than the expected interruption probability. Constraint M2 means that the transmission power on each link cannot be higher than the power threshold of the link. Constraints M3 and M4 mean that each link can only select one subchannel in each time slot.

在本发明的一些实施例中，在所述步骤S2中，采用参数变换的方式对非凸优化目标进行转换以获得不含中断概率约束的非凸优化目标：In some embodiments of the present invention, in step S2, the non-convex optimization objective is transformed by parameter transformation to obtain a non-convex optimization objective without interruption probability constraints:

其中，

in,

其中，Ω^t表示的无线网络经参数变换后在时隙t的平均频谱效率。Wherein, Ω ^t represents the average spectrum efficiency of the wireless network in time slot t after parameter transformation.

在本发明的一些实施例中，所述初始资源分配系统包括：信道分配模型和功率分配模型，所述信道分配模型用于基于一个时隙的非完美全局信道状态信息预测该时隙的信道分配策略，其配置为DQN网络、DDQN网络或Dueling DQN网络；所述功率分配模型用于基于一个时隙的非完美全局信道状态信息预测该时隙的功率分配策略，其配置为DDPG网络。In some embodiments of the present invention, the initial resource allocation system includes: a channel allocation model and a power allocation model, the channel allocation model is used to predict the channel allocation strategy of a time slot based on the imperfect global channel state information of the time slot, and is configured as a DQN network, a DDQN network or a Dueling DQN network; the power allocation model is used to predict the power allocation strategy of the time slot based on the imperfect global channel state information of the time slot, and is configured as a DDPG network.

在本发明的一些实施例中，所述步骤S4包括步骤S41、步骤S42和步骤S43。其中，步骤S41包括：获取输入时隙的非完美全局信道状态信息并执行如下步骤：S411、由信道分配模型根据输入时隙的非完美全局信道状态信息预测该输入时隙的信道分配策略，基于所述预测的信道分配策略更新输入时隙的非完美全局信道状态信息，并且由功率分配模型根据更新后的输入时隙的非完美全局信道状态信息预测该输入时隙的功率分配策略；由预测的输入时隙的信道分配策略和功率分配策略与无线网络进行交互得到输入时隙的下一时隙的非完美全局信道状态信息，以及由信道分配模型根据输入时隙的下一时隙的非完美全局信道状态信息预测输入时隙的下一时隙的信道分配策略，基于所述输入时隙的下一时隙的信道分配策略更新输入时隙的下一时隙的信道分配策略；S412、基于输入时隙的信道分配策略和功率分配策略计算该输入时隙的频谱效率奖励；S413、以输入时隙的非完美全局信道状态信息、输入时隙的信道分配策略、输入时隙的频谱效率奖励、输入时隙的下一时隙的非完美全局信道状态信息为一条信道分配经验存入信道选择回放池；以更新后的输入时隙的非完美全局信道状态信息、输入时隙的功率分配策略、输入时隙的频谱效率奖励、更新后的输入时隙的下一时隙的非完美全局信道状态信息为一条功率分配经验存入功率选择回放池。步骤S42包括：以上一次输入时隙的下一时隙的非完美全局信道状态信息为新的输入时隙的非完美全局信道状态信息。步骤S43包括：基于信道选择回放池中的信道分配经验和功率选择回放池中的功率分配更新初始资源分配系统参数直至收敛。In some embodiments of the present invention, step S4 includes step S41, step S42 and step S43. Step S41 includes: obtaining the non-perfect global channel state information of the input time slot and executing the following steps: S411, the channel allocation model predicts the channel allocation strategy of the input time slot according to the non-perfect global channel state information of the input time slot, updates the non-perfect global channel state information of the input time slot based on the predicted channel allocation strategy, and the power allocation model predicts the power allocation strategy of the input time slot according to the updated non-perfect global channel state information of the input time slot; the predicted channel allocation strategy and power allocation strategy of the input time slot interact with the wireless network to obtain the non-perfect global channel state information of the next time slot of the input time slot, and the channel allocation model predicts the channel allocation strategy of the next time slot of the input time slot according to the non-perfect global channel state information of the next time slot of the input time slot. allocation strategy, update the channel allocation strategy of the next time slot of the input time slot based on the channel allocation strategy of the next time slot of the input time slot; S412, calculate the spectrum efficiency reward of the input time slot based on the channel allocation strategy and power allocation strategy of the input time slot; S413, store the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the non-perfect global channel state information of the next time slot of the input time slot as a channel allocation experience in the channel selection playback pool; store the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the non-perfect global channel state information of the next time slot of the input time slot as a power allocation experience in the power selection playback pool. Step S42 includes: using the non-perfect global channel state information of the next time slot of the previous input time slot as the new non-perfect global channel state information of the input time slot. Step S43 includes: updating the initial resource allocation system parameters based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence.

在本发明的一些实施例中，在所述步骤S43中，当信道选择回放池中有一条信道分配经验时即开始更新信道分配模型的参数；当功率选择回放池中有一条功率分配经验即开始更新功率分配模型的参数。In some embodiments of the present invention, in step S43, when there is a channel allocation experience in the channel selection playback pool, the parameters of the channel allocation model are updated; when there is a power allocation experience in the power selection playback pool, the parameters of the power allocation model are updated.

在本发明的一些实施例中，在所述步骤S43中；当信道选择回放池中的信道分配经验达到预设条数的经验后对信道分配模型的参数进行多次更新直至收敛，其中，每次更新时从信道选择回放池进行随机采样获得多条信道分配经验，并基于采样到的信道分配经验采用梯度下降的方式更新信道分配模型的参数；当功率选择回放池中的功率分配经验达到预设条数的经验后对功率分配模型的参数进行多次更新直至收敛，其中，每次更新时从功率选择回放池进行随机采样获得多条功率分配经验，并基于采样到的功率分配经验采用梯度下降的方式更新功率分配模型的参数。In some embodiments of the present invention, in step S43; when the channel allocation experience in the channel selection replay pool reaches a preset number of experiences, the parameters of the channel allocation model are updated multiple times until convergence, wherein each update is performed by randomly sampling from the channel selection replay pool to obtain multiple channel allocation experiences, and the parameters of the channel allocation model are updated by gradient descent based on the sampled channel allocation experiences; when the power allocation experience in the power selection replay pool reaches a preset number of experiences, the parameters of the power allocation model are updated multiple times until convergence, wherein each update is performed by randomly sampling from the power selection replay pool to obtain multiple power allocation experiences, and the parameters of the power allocation model are updated by gradient descent based on the sampled power allocation experiences.

在本发明的一些实施例中，在步骤S41中，所述输入时隙的非完美全局信道状态信息包括多条链路在输入时隙选择不同子信道的非完美全局信道状态信息：In some embodiments of the present invention, in step S41, the imperfect global channel state information of the input time slot includes imperfect global channel state information of multiple links selecting different sub-channels in the input time slot:

其中，

in,

其中，

表示第k条链路在时隙t选择第n个子信道的状态集，

表示在信道估计误差存在下第k条链路在时隙t选择第n个子信道的独立信道增益，

表示第k条链路在时隙t选择第n个子信道的信道功率，

表示第k条链路在时隙t-1选择第n个子信道后的标识值,

表示第k条链路在时隙t-1选择第n个子信道的功率，

表示第k条链路在时隙t-1对应的频谱效率，

表示第k条链路在时隙t选择第n个子信道对应的估计小尺度衰落分量

与总干扰功率的比率比值在所有信道上的排序值，

表示第k条链路在时隙t选择第n个子信道时采用上一时隙的子信道分配方案和功率分配方案下的同信道干扰，k′表示不同于k的其他链路，

代表信道估计误差的方差，

是考虑阴影衰落和几何衰减的大规模衰落分量，

表示均值为0，方差为

的复高斯分布。in,

represents the state set of the kth link selecting the nth subchannel in time slot t,

represents the independent channel gain of the kth link selecting the nth subchannel in time slot t in the presence of channel estimation error,

represents the channel power of the nth subchannel selected by the kth link in time slot t,

represents the identification value of the kth link after selecting the nth subchannel in time slot t-1,

represents the power of the nth subchannel selected by the kth link in time slot t-1,

represents the spectrum efficiency of the kth link at time slot t-1,

It represents the estimated small-scale fading component corresponding to the n-th subchannel selected by the k-th link in time slot t.

The ratio of the total interference power to the ranking value of all channels,

represents the co-channel interference under the subchannel allocation scheme and power allocation scheme of the previous time slot when the k-th link selects the n-th subchannel in time slot t, k′ represents other links different from k,

represents the variance of the channel estimation error,

is the large-scale fading component that takes into account shadow fading and geometric attenuation,

The mean is 0 and the variance is

The complex Gaussian distribution of .

在本发明的一些实施例中，采用如下方式计算频谱效率奖励：In some embodiments of the present invention, the spectrum efficiency bonus is calculated in the following manner:

其中，

in,

其中，

表示第k条链路在时隙t选择第n个子信道对应的频谱效率，ε_out表示期望的中断概率，

表示第k条链路在时隙t选择第n个子信道的调度频谱效率，φ是干扰的权重系数，k′表示不同于k的其他链路，

表示第k条链路在时隙t选择第n个子信道的外部干扰性，

表示在时隙t第n个子信道没有第k条链路干扰的链路k′的频谱效率，

表示第k′条链路在时隙t选择第n个子信道对应的频谱效率。in,

represents the spectral efficiency of the kth link selecting the nth subchannel in time slot t, ε _out represents the expected outage probability,

represents the scheduling spectrum efficiency of the kth link selecting the nth subchannel in time slot t, φ is the weight coefficient of interference, k′ represents other links different from k,

represents the external interference of the kth link selecting the nth subchannel in time slot t,

represents the spectral efficiency of link k′ in the nth subchannel at time slot t without interference from the kth link,

It represents the spectral efficiency corresponding to the k′th link selecting the nth subchannel in time slot t.

在本发明的一些实施例中，所述DDPG网络包括Actor网络和Critic网络，所述最终的资源分配系统为：训练至收敛的DQN网络、DDQN网络或Dueling DQN网络与Actor网络。In some embodiments of the present invention, the DDPG network includes an Actor network and a Critic network, and the final resource allocation system is: a DQN network trained to convergence, a DDQN network, or a Dueling DQN network and an Actor network.

根据本发明的第二方面，提供一种无线网络资源管理方法，所述方法包括：T1、获取无线通信系统在上一时隙的无线网络状态；T2、基于步骤T1获得的上一时隙的无线网络状态，采用本发明的第一方面所述的方法得到的资源分配系统预测下一时刻的资源分配策略；T3、基于步骤T2得到的下一时刻的资源分配策略分配无线通信系统中的无线网络资源。According to a second aspect of the present invention, a wireless network resource management method is provided, the method comprising: T1, obtaining the wireless network status of the wireless communication system in the previous time slot; T2, based on the wireless network status of the previous time slot obtained in step T1, using the resource allocation system obtained by the method described in the first aspect of the present invention to predict the resource allocation strategy at the next moment; T3, allocating wireless network resources in the wireless communication system based on the resource allocation strategy at the next moment obtained in step T2.

根据本发明的第三方面，提供一种无线通信系统，所述系统包括多个基站，所述每个基站包括无线资源管理单元，所述无线资源管理单元被配置采用如本发明第二方面所述的方法分配基站中的无线网络资源。According to a third aspect of the present invention, there is provided a wireless communication system, the system comprising a plurality of base stations, each of the base stations comprising a wireless resource management unit, the wireless resource management unit being configured to allocate wireless network resources in the base station using the method described in the second aspect of the present invention.

与现有技术相比，本发明的优点在于：采用在非完美全局信道状态信息环境下无线通信需求对应的具有中断概率约束的非凸优化目标作为训练目标，可以充分考虑实际的通信环境中的信道估计误差，即采用更切合实际的CSI(非完美全局信道状态信息)对基于学习的初始资源分配系统进行训练，提高了无线网络资源分配系统的收敛速率并提高了完成优化目标的性能。Compared with the prior art, the advantages of the present invention are: by using a non-convex optimization objective with an interruption probability constraint corresponding to the wireless communication demand in a non-perfect global channel state information environment as a training objective, the channel estimation error in the actual communication environment can be fully considered, that is, a more practical CSI (non-perfect global channel state information) is used to train the learning-based initial resource allocation system, thereby improving the convergence rate of the wireless network resource allocation system and improving the performance of completing the optimization objective.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

以下参照附图对本发明实施例作进一步说明，其中：The embodiments of the present invention are further described below with reference to the accompanying drawings, in which:

图1为根据本发明实施例的一种无线网络资源分配系统的构建方法的流程示意图；FIG1 is a schematic flow chart of a method for constructing a wireless network resource allocation system according to an embodiment of the present invention;

图2为根据本发明实施例的一种由Dueling DQN网络和DDPG网络组成的初始分配系统进行模型训练和参数更新架构示意图；FIG2 is a schematic diagram of an architecture for model training and parameter updating of an initial allocation system composed of a Dueling DQN network and a DDPG network according to an embodiment of the present invention;

图3为根据本发明实施例的一种无线网络资源管理方法流程示意图；FIG3 is a schematic flow chart of a wireless network resource management method according to an embodiment of the present invention;

图4为根据本发明实施例的本专利所提算法和上述四种基线算法的收敛性能对比示意图；FIG4 is a schematic diagram showing a comparison of convergence performance of the algorithm proposed in this patent and the above four baseline algorithms according to an embodiment of the present invention;

图5为根据本发明实施例的专利所提算法和上述四种基线算法可实现的频谱效率与信道估计误差方差的关系对比示意图；FIG5 is a schematic diagram showing a comparison of the relationship between the spectrum efficiency and the channel estimation error variance that can be achieved by the algorithm proposed in the patent according to an embodiment of the present invention and the above four baseline algorithms;

图6为根据本发明实施例的本专利所提算法和上述四种基线算法的频谱效率在不同子信道数量下的性能对比示意图。FIG6 is a schematic diagram showing a performance comparison of the spectrum efficiency of the algorithm proposed in this patent according to an embodiment of the present invention and the above four baseline algorithms under different numbers of sub-channels.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的，技术方案及优点更加清楚明白，以下通过具体实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solutions and advantages of the present invention more clearly understood, the present invention is further described in detail by specific embodiments below. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.

正如背景技术中提到的，现有的基于学习的方法没有针对非完美CSI进行设计，并且实际的通信环境中由于信道估计误差是客观存在且无法完全消除，直接采用现有的基于学习的算法在非完美CSI环境下所能达到的优化目标效果较差且算法的收敛速度较低。为解决上述问题，本发明从非完美CSI的特性出发，提出一种基于非完美CSI的无线网络资源分配方案。通过将非完美CSI下的无线网络资源分配问题进行建模为对无线通信需求对应的具有中断概率约束的非凸优化目标进行求解的问题，并且考虑到具有概率约束的优化目标难以求解，本发明进一步将具有中断概率约束的非凸优化目标转换为不含中断概率约束的非凸优化目标并采用基于学习的方法来进行求解，以此来提高在非完美CSI环境下所能达到的优化目标效果和算法的收敛速度。As mentioned in the background technology, the existing learning-based methods are not designed for non-perfect CSI, and in the actual communication environment, since the channel estimation error is objectively present and cannot be completely eliminated, the optimization target effect that can be achieved by directly using the existing learning-based algorithm in the non-perfect CSI environment is poor and the convergence speed of the algorithm is low. To solve the above problems, the present invention proposes a wireless network resource allocation scheme based on non-perfect CSI based on the characteristics of non-perfect CSI. By modeling the wireless network resource allocation problem under non-perfect CSI as a problem of solving a non-convex optimization target with an interruption probability constraint corresponding to the wireless communication demand, and considering that the optimization target with a probability constraint is difficult to solve, the present invention further converts the non-convex optimization target with an interruption probability constraint into a non-convex optimization target without an interruption probability constraint and adopts a learning-based method to solve it, so as to improve the optimization target effect and the convergence speed of the algorithm that can be achieved in the non-perfect CSI environment.

为了更好的理解本发明，下面结合附图和实施例详细说明本发明的方案。In order to better understand the present invention, the scheme of the present invention is described in detail below with reference to the accompanying drawings and embodiments.

根据本发明的一个实施例，提供一种无线网络资源分配系统的构建方法，所述无线网络资源分配系统用于根据无线网络状态获得无线网络资源分配策略,如图1所示，所述方法包括：S1、获取在非完美全局信道状态信息环境下无线通信需求对应的具有中断概率约束的非凸优化目标；S2、对步骤S1获得的非凸优化目标进行转换以获得不含中断概率约束的非凸优化目标；S3、获取无线网络的非完美全局信道状态信息；S4、以步骤S2中的非凸优化目标为训练目标，并以步骤S3中的非完美全局信道状态信息为输入，采用强化学习的方式训练初始资源分配系统至收敛，其中，所述初始资源分配系统为基于智能体构建的且用于基于无线网络状态生成动作集合的系统，所述动作集合包括信道分配策略和功率分配策略。为了更好地介绍本发明的具体方案，以下从不含中断概率约束的非凸优化目标的建立、模型训练、实验验证几个方面进行详细讲解。According to an embodiment of the present invention, a method for constructing a wireless network resource allocation system is provided, wherein the wireless network resource allocation system is used to obtain a wireless network resource allocation strategy according to a wireless network state, as shown in FIG1 , wherein the method comprises: S1, obtaining a non-convex optimization target with an interruption probability constraint corresponding to a wireless communication demand under a non-perfect global channel state information environment; S2, converting the non-convex optimization target obtained in step S1 to obtain a non-convex optimization target without an interruption probability constraint; S3, obtaining non-perfect global channel state information of the wireless network; S4, using the non-convex optimization target in step S2 as a training target, and using the non-perfect global channel state information in step S3 as input, and using a reinforcement learning method to train an initial resource allocation system to convergence, wherein the initial resource allocation system is a system constructed based on an agent and used to generate an action set based on a wireless network state, and the action set includes a channel allocation strategy and a power allocation strategy. In order to better introduce the specific scheme of the present invention, the following is a detailed explanation from the aspects of the establishment of a non-convex optimization target without an interruption probability constraint, model training, and experimental verification.

一、不含中断概率约束的非凸优化目标的建立1. Establishment of non-convex optimization objective without interruption probability constraint

由于现有的基于学习的网络资源分配方法没有针对非完美CSI进行设计，为了更好理解，以下将从无线通信网络的优化目标的建立和转换进行详细讲解。其中，为了便于理解，本发明实施例中采用公式推导的方式介绍这个过程。Since the existing learning-based network resource allocation method is not designed for imperfect CSI, for better understanding, the establishment and conversion of the optimization target of the wireless communication network will be explained in detail below. In order to facilitate understanding, the embodiment of the present invention adopts the method of formula derivation to introduce this process.

本发明首先针对非完美CSI下的无线网络环境进行数学化描述，进而基于环境的数学化描述对无线网络环境进行建模。无线通信网络包括多个通信区域，每个通信区域具有基站和多个用户，多个通信区域中的所有用户共享多条子信道，每个基站都位于区域的中心，授权用户随机分布在通信区域内；所有用户和基站的收发器都配备一个天线，并且每条形成的链路在一个时隙中只能选择一个子信道。例如，无线网络中一个下行链路具有多小区多用户的网络场景，其中K个链路分布在M个小区中且共享N个正交子信道，其中，

和

分别表示链路索引集合、小区索引集合和子信道索引集合。The present invention first mathematically describes the wireless network environment under non-perfect CSI, and then models the wireless network environment based on the mathematical description of the environment. The wireless communication network includes multiple communication areas, each communication area has a base station and multiple users, all users in the multiple communication areas share multiple sub-channels, each base station is located at the center of the area, and authorized users are randomly distributed in the communication area; the transceivers of all users and base stations are equipped with an antenna, and each formed link can only select one sub-channel in a time slot. For example, a downlink in a wireless network has a network scenario with multiple cells and multiple users, where K links are distributed in M cells and share N orthogonal sub-channels, where,

and

They represent the link index set, cell index set and subchannel index set respectively.

在无线通信环境中，考虑一个具有时间间隔的完全同步系统，第k条链路在时隙t选择第n个子信道的独立信道增益可以表示为：In a wireless communication environment, considering a fully synchronized system with a time interval, the independent channel gain of the kth link selecting the nth subchannel in time slot t can be expressed as:

其中，

表示考虑阴影衰落和几何衰减的大规模衰落分量，其中，假定

在多个时隙中是不变的；

表示第k条链路在时隙t选择第n个子信道的估计小尺度衰落分量。in,

represents the large-scale fading component considering shadow fading and geometric attenuation, where it is assumed that

is constant across multiple time slots;

It represents the estimated small-scale fading component of the n-th subchannel selected by the k-th link at time slot t.

在无线通信环境中，考虑归一化带宽，在完美CSI的情况下，第k条链路在时隙t选择第n个子信道的最大频谱效率：In a wireless communication environment, considering the normalized bandwidth, under perfect CSI, the maximum spectral efficiency of the kth link selecting the nth subchannel in time slot t is:

其中，

表示第k条链路在时隙t选择第n个子信道后的标识值，例如

表示第k条链路在时隙t选择了第n个子信道，否则

表示第k条链路在时隙t选择第n个子信道的功率，σ²表示加性高斯白噪声的功率，

表示第k条链路在时隙t选择了第n个子信道时所受到的同信道干扰。in,

It represents the identification value of the kth link after selecting the nth subchannel in time slot t, for example

Indicates that the kth link selects the nth subchannel in time slot t, otherwise

represents the power of the nth subchannel selected by the kth link in time slot t, σ ² represents the power of additive white Gaussian noise,

It represents the co-channel interference suffered by the k-th link when the n-th sub-channel is selected in time slot t.

在无线通信环境中，由于不可避免的信道估计误差，完美CSI中，

假设为真实值，这样的假设忽略了实际的通信环境中的信道估计误差。由此，需要对小尺度衰落分量进行客观的估计。假设基站可以完美估计大尺度衰落系数由于其变化缓慢，而小尺度衰落变化迅速因此不能完美估计小尺度的衰落系数，于是，在本发明的一个实施例中，基于非完美CSI，将第k条链路在时隙t选择第n个子信道的估计小尺度衰落分量表示为：In wireless communication environments, due to the inevitable channel estimation error, in perfect CSI,

Assuming that it is a true value, such an assumption ignores the channel estimation error in the actual communication environment. Therefore, it is necessary to objectively estimate the small-scale fading component. Assuming that the base station can perfectly estimate the large-scale fading coefficient because it changes slowly, and the small-scale fading changes rapidly, it is impossible to perfectly estimate the small-scale fading coefficient. Therefore, in one embodiment of the present invention, based on the imperfect CSI, the estimated small-scale fading component of the n-th subchannel selected by the k-th link in time slot t is expressed as:

其中，

in,

其中，

表示第k条链路在时隙t选择第n个子信道的估计小尺度衰落分量的误差，并且每个

相互独立。

表示均值为0，方差为

的复高斯分布，

表示均值为0，方差为

的复高斯分布，

代表信道估计误差的方差。需要说明的是，现有完美的CSI信息存在缺陷主要指小规模的衰落系数通常不能被完美估计，见公式(3)。由于信道估计误差的存在以及其他因素，小规模的衰落系数的信道估计值

通常不等于真实值。如果是基于现有完美CSI的算法来直接应用到非完美CSI的环境中，相当于是直接把估计值

当作真实值来进行资源分配，由于估计值和真实值之间存在误差(即信道估计误差)，因此如果直接采用基于完美CSI的算法进行网络资源分配，提升的传输性能和网络容量的效果一般。而实际情况中一定是存在信道估计误差以及其他因素导致CSI并不能被完美估计，因此需要考虑这个非完美CSI因素，且直接使用基于现有的完美CSI的资源分配算法相当于直接用估计值代替了实际值，会降低算法的性能。in,

represents the error of the estimated small-scale fading component of the nth subchannel selected by the kth link in time slot t, and each

Independent of each other.

The mean is 0 and the variance is

The complex Gaussian distribution of

The mean is 0 and the variance is

The complex Gaussian distribution of

Represents the variance of the channel estimation error. It should be noted that the existing perfect CSI information has defects mainly in that small-scale fading coefficients cannot usually be perfectly estimated, see formula (3). Due to the existence of channel estimation errors and other factors, the channel estimation value of small-scale fading coefficients

Usually it is not equal to the true value. If the algorithm based on the existing perfect CSI is directly applied to the environment of imperfect CSI, it is equivalent to directly applying the estimated value

Assuming that the estimated value is the real value for resource allocation, there is an error between the estimated value and the real value (i.e., the channel estimation error). Therefore, if the algorithm based on perfect CSI is directly used for network resource allocation, the effect of improving transmission performance and network capacity is average. In reality, there must be channel estimation errors and other factors that cause CSI to not be perfectly estimated. Therefore, this non-perfect CSI factor needs to be considered. Directly using the resource allocation algorithm based on the existing perfect CSI is equivalent to directly replacing the actual value with the estimated value, which will reduce the performance of the algorithm.

在上述无线网络环境的数学化描述后，以下将需要优化问题(即无线通信需求)进行建模。需要优化问题至少包括最大化吞吐量和最大化频谱效率。由于最大化吞吐量和最大化频谱效率在定义上可以通过公式进行相互转化，本发明实施例中将以最大化频谱效率为例进行建模讲解，最大化吞吐量的建模此处不在赘述。After the mathematical description of the wireless network environment, the optimization problem (i.e., wireless communication demand) will be modeled below. The optimization problem needs to include at least maximizing throughput and maximizing spectrum efficiency. Since maximizing throughput and maximizing spectrum efficiency can be converted to each other through formulas in definition, the embodiment of the present invention will take maximizing spectrum efficiency as an example for modeling explanation, and the modeling of maximizing throughput will not be repeated here.

由于受到非完美CSI的影响，调度的频谱效率可能会超过香农容量公式定义的最大可实现频谱效率。因此，当调度的频谱效率超过非完美CSI时可实现的频谱效率时，中断概率被用作衡量性能的指标。当第k条链路在时隙t选择第n个子信道的调度频谱效率表示为

则无线网络在时隙t的平均频谱效率由下式给出：Due to the influence of imperfect CSI, the spectral efficiency of scheduling may exceed the maximum achievable spectral efficiency defined by Shannon's capacity formula. Therefore, when the spectral efficiency of scheduling exceeds the achievable spectral efficiency with imperfect CSI, the outage probability is used as a performance metric. The scheduling spectral efficiency when the kth link selects the nth subchannel in time slot t is expressed as

Then the average spectral efficiency of the wireless network in time slot t is given by:

进一步的，在时隙t，非完美CSI下，最大化无线网络的频谱效率对应的具有中断概率约束的非凸优化目标：Furthermore, at time slot t, under non-perfect CSI, the non-convex optimization objective with outage probability constraint corresponding to maximizing the spectrum efficiency of the wireless network is:

其中，

in,

其中，

表示子信道索引的集合，

表示第k条链路在时隙t选择第n个子信道的调度频谱效率，

表示第k条链路在时隙t选择第n个子信道的最大频谱效率，

表示在估计小尺度衰落分量

为条件下

的概率，

表示第k条链路在时隙t选择第n个子信道的功率，p^t表示所有

功率的集合，

represents a set of subchannel indices,

Indicates that in estimating the small-scale fading component

Under the condition

The probability of

A collection of power,

由于具有中断概率约束的非凸优化目标即使在无线网络资源中的子信道策略固定时，仅仅考虑功率分配问题，也已被证明为NP-Hard问题(所有非确定性多项式问题都能在多项式时间复杂度内归约到的问题)，因此具有中断概率约束的非凸优化目标的最优解难以直接通过数学推导进行求解。本发明针对该问题的求解，采用参数变换的方式将原始优化目标(即具有中断概率约束的非凸优化目标)转换(通过约束条件的替换以及相应的求解转换)为不含中断概率约束的非凸优化目标，使得最大化无线网络的频谱效率对应的具有中断概率约束的非凸优化目标能够求解。以下将从约束条件替换和优化问题转换两部分详细介绍采用参数变换的方式将原始优化目标转换过程。Since the non-convex optimization objective with interruption probability constraint has been proved to be an NP-Hard problem (a problem to which all non-deterministic polynomial problems can be reduced within polynomial time complexity) even when the sub-channel strategy in the wireless network resources is fixed and only the power allocation problem is considered, the optimal solution of the non-convex optimization objective with interruption probability constraint is difficult to be solved directly by mathematical deduction. In order to solve this problem, the present invention adopts parameter transformation to convert the original optimization objective (i.e., the non-convex optimization objective with interruption probability constraint) into a non-convex optimization objective without interruption probability constraint (through the replacement of constraint conditions and corresponding solution conversion), so that the non-convex optimization objective with interruption probability constraint corresponding to maximizing the spectrum efficiency of the wireless network can be solved. The following will introduce in detail the process of converting the original optimization objective by parameter transformation from two parts: constraint replacement and optimization problem conversion.

在约束条件替换过程中，发明人考虑了一个更严格的约束R1来替换中断概率约束M1，使得约束R1总是满足上述非凸优化目标中的中断概率约束M1，其中，约束R1为：In the process of replacing the constraint condition, the inventors considered a stricter constraint R1 to replace the interruption probability constraint M1, so that the constraint R1 always satisfies the interruption probability constraint M1 in the above non-convex optimization objective, where the constraint R1 is:

其中，

表示在香农公式定义下的第k条链路在时隙t选择第n个子信道的噪声以及干扰信号强度,

表示实际可调度下的第k条链路在时隙t选择第n个子信道的噪声以及干扰信号强度,

表示实际可调度下的第k条链路在时隙t选择第n个子信道的有用信号强度，

表示在香农公式定义下的第k条链路在时隙t选择第n个子信道的有用信号强度。约束R1-1表示在

条件下对于所有的k和n，

小于

的概率不能大于

约束R1-2表示在

条件下对于所有的k和n，

小于

的概率均等于

in,

represents the noise and interference signal strength of the nth subchannel selected by the kth link in time slot t under the Shannon formula,

represents the noise and interference signal strength of the nth subchannel selected by the kth link in time slot t under actual scheduling,

It represents the useful signal strength of the nth subchannel selected by the kth link in time slot t under actual schedulability.

represents the useful signal strength of the nth subchannel selected by the kth link in time slot t under the Shannon formula. Constraint R1-1 represents

For all k and n under the condition,

Less than

The probability cannot be greater than

Constraint R1-2 means that

For all k and n under the condition,

Less than

The probability is equal to

以下将讲解约束R1比中断约束M1更严格的证明过程。证明过程包括参数定义和证明推理两部分。The following will explain the proof process that constraint R1 is stricter than interrupt constraint M1. The proof process includes two parts: parameter definition and proof reasoning.

参数定义部分为：首先根据香农公式定义The parameter definition part is as follows: First, according to the Shannon formula,

同理定义在非完美CSI的情况下，第k条链路在时隙t选择第n个子信道的调度频谱效率为：Similarly, in the case of imperfect CSI, the scheduling spectrum efficiency of the kth link selecting the nth subchannel in time slot t is defined as:

其中，

表示实际可调度下的第k条链路在时隙t选择第n个子信道的信道干扰比。in,

It represents the channel interference ratio of the kth link selecting the nth subchannel in time slot t under actual schedulable conditions.

由公式(2)和公式(7)可得From formula (2) and formula (7), we can get

则

由公式(8)可得

则原始中断概率约束M1可以写为：but

From formula (8), we can get

Then the original outage probability constraint M1 can be written as:

由公式(7)、(8)带入公式(9)可得：Substituting formula (7) and (8) into formula (9) yields:

根据全概率公式可得：According to the total probability formula, we can get:

其中，

in,

其中，Pr(E₁)表示在

和

的条件下

小于

的概率,Pr(E₂)表示在

和

的条件下

小于

的概率。Where Pr(E ₁ ) represents

and

Under the conditions

Less than

The probability of Pr(E ₂ ) is expressed in

and

Under the conditions

Less than

probability.

证明推理部分为：The reasoning part of the proof is:

其中，约束R1中的约束R1-1证明如下：Among them, constraint R1-1 in constraint R1 is proved as follows:

将约束R1-1的

替换为

则

进而可得

对于Pr(E₂)来说，由于

则有

一定小于

Constrain R1-1

Replace with

but

Then we can get

For Pr(E ₂ ), due to

Then there is

Must be less than

约束R1中的约束R1-2证明如下：The constraint R1-2 in constraint R1 is proved as follows:

根据全概率公式可得

由于Pr(E₁)≤1，则有

According to the total probability formula, we can get

Since Pr(E ₁ )≤1, we have

经过上述约束R1中的约束R1-1和约束R1-2的证明过程可以看出约束R1是比约束M1更严格的约束。Through the proof process of constraint R1-1 and constraint R1-2 in the above constraint R1, it can be seen that constraint R1 is a stricter constraint than constraint M1.

在优化问题转换过程中，根据更严格的约束R1，对原始优化问题进行转换。以下从约束R1中的约束R1-1的变换、约束R1-2的变换和具有中断概率约束的非凸优化目标的转换讲解具体推导过程。In the optimization problem conversion process, the original optimization problem is converted according to the stricter constraint R1. The following explains the specific derivation process from the transformation of constraint R1-1 in constraint R1, the transformation of constraint R1-2, and the transformation of the non-convex optimization objective with interruption probability constraint.

根据上述更严格的约束R1-1可得：According to the stricter constraint R1-1 above, we can get:

根据马尔可夫不等式，由公式(12)可以得到：According to Markov inequality, formula (12) can be obtained:

令公式(13)的右侧等于

则有：Let the right side of formula (13) equal to

Then we have:

根据上述更严格的约束R1-2可得：According to the stricter constraint R1-2 above, we can get:

其中，F代表卡方分布的累积分布函数，令公式(15)等于

可以得到：Where F represents the cumulative distribution function of the chi-square distribution, let formula (15) equal to

You can get:

其中，F^-1代表卡方分布的逆累积分布函数(CDF)。由于

且

将这两项以及公式(16)代入公式(14)中可以得到：Where F ^-1 represents the inverse cumulative distribution function (CDF) of the chi-square distribution.

and

Substituting these two terms and formula (16) into formula (14), we can obtain:

因此可以得到：So we can get:

公式(18)等价为：Formula (18) is equivalent to:

由此，无线网络经参数变换后在时隙t的平均频谱效率表示为：Therefore, the average spectrum efficiency of the wireless network in time slot t after parameter transformation is expressed as:

其中，F^-1代表卡方分布的逆累积分布函数(CDF)。Here, F ^-1 represents the inverse cumulative distribution function (CDF) of the chi-square distribution.

综上所述，具有中断概率约束的非凸优化目标转换为不含中断概率约束的非凸优化目标为：In summary, the non-convex optimization objective with interruption probability constraints is transformed into a non-convex optimization objective without interruption probability constraints:

其中，

in,

其中，Ω^t表示的无线网络经参数变换后在时隙t的平均频谱效率。需要说明的是，本发明由于针对更实际的场景下的资源分配中考虑了信道估计误差引起的非完美CSI，而非完美CSI会导致存在中断概率，因此优化模型中具有中断概率的约束不能直接使用原有的基于完美CSI的算法进行求解。因此对优化模型进行了参数转化后，针对转换后的优化模型设计了新的基于非完美CSI的学习算法，并将非完美CSI、信道估计误差等参数设计为状态集合中的一部分，使得深度强化学习网络能有效地学习到非完美CSI的影响来提高基于学习的算法在非完美CSI环境下所能达到的优化目标效果和算法的性能。Among them, Ω ^t represents the average spectral efficiency of the wireless network in time slot t after parameter transformation. It should be noted that the present invention takes into account the imperfect CSI caused by channel estimation error in resource allocation for more practical scenarios, and imperfect CSI will lead to the existence of interruption probability. Therefore, the constraint with interruption probability in the optimization model cannot be directly solved using the original algorithm based on perfect CSI. Therefore, after the optimization model is parameterized, a new learning algorithm based on imperfect CSI is designed for the converted optimization model, and parameters such as imperfect CSI and channel estimation error are designed as part of the state set, so that the deep reinforcement learning network can effectively learn the impact of imperfect CSI to improve the optimization target effect and algorithm performance that can be achieved by the learning-based algorithm in an imperfect CSI environment.

二、模型训练2. Model Training

经过上述步骤对具有中断概率约束的非凸优化目标进行转换，得到不含中断概率约束的非凸优化目标(公式21)，仍然属于NP-Hard问题。传统算法，例如背景技术部分提到的参考文献[5]记载的求解方案，需要多次迭代才能收敛，并且随着用户链路数的增加不能很好地扩展。此外，通信系统中的集中控制器获取瞬时全局CSI并将分配方案发送回BS是非常具有挑战性的。为了使不含中断概率约束的非凸优化目标可解，首先将联合无线通信需求(本实施例以无线通信需求为最大化无线网络的频谱效率为例进行展开说明，但并不意味着无线通信需求只有最大化频谱效率)解耦为两个子问题，即子信道选择子问题和功率分配子问题。然后用能同时处理这两个子问题的学习模型(初始资源分配系统)来处理最大化无线网络的频谱效率的问题以提高最终的资源分配系统的收敛性能和优化目标的效果。其中，最终的资源分配系统是以非凸优化目标为训练目标，并采用与该训练目标相关的非完美全局信道状态信息、资源分配策略组成训练集对初始资源分配系统训练获得的。After the above steps, the non-convex optimization objective with the interruption probability constraint is transformed to obtain the non-convex optimization objective without the interruption probability constraint (Formula 21), which is still an NP-Hard problem. Traditional algorithms, such as the solution described in reference [5] mentioned in the background technology section, require multiple iterations to converge and cannot be well expanded as the number of user links increases. In addition, it is very challenging for the centralized controller in the communication system to obtain the instantaneous global CSI and send the allocation scheme back to the BS. In order to make the non-convex optimization objective without the interruption probability constraint solvable, the joint wireless communication demand (this embodiment takes the wireless communication demand as maximizing the spectrum efficiency of the wireless network as an example for explanation, but it does not mean that the wireless communication demand only maximizes the spectrum efficiency) is first decoupled into two sub-problems, namely the sub-channel selection sub-problem and the power allocation sub-problem. Then, a learning model (initial resource allocation system) that can handle these two sub-problems at the same time is used to handle the problem of maximizing the spectrum efficiency of the wireless network to improve the convergence performance of the final resource allocation system and the effect of the optimization objective. The final resource allocation system is obtained by training the initial resource allocation system using a non-convex optimization objective as the training objective and using a training set composed of non-perfect global channel state information and resource allocation strategies related to the training objective.

根据本发明的一个实施例，初始资源分配系统包括：信道分配模型(本发明实施例中也称之为第一层网络)和功率分配模型(本发明实施例中也称之为第二层网络)；所述信道分配模型用于基于一个时隙的非完美全局信道状态信息预测该时隙的信道分配策略；优选的，信道分配模型配置为DQN网络、DDQN网络或Dueling DQN网络；所述功率分配模型用于基于一个时隙的非完美全局信道状态信息预测该时隙的功率分配策略，优选的，功率分配模型配置为DDPG网络。需要说明的是，信道分配子问题属于离散类型任务，功率分配子问题属于连续类型的任务，采用前面实施例中所述的信道分配模型和功率分配模型组成的双层学习网络架构中可以避免引入量化误差。其中，对于信道分配，采用DQN网络、DDQN网络或Dueling DQN网络处理离散变量类型的资源；对于功率分配，信道功率是由P_max限制的连续标量(对于一些算法，比如基于值的DQN算法，动作空间必须是有限的，发射功率可能会被离散化，连续变量的离散化必然会导致量化误差)，为了避免信道功率被离散化，本发明中的第二层网络采用DDPG网络，DDPG网络包括Actor网络和Critic网络，Actor网络的作用是输出分配的功率，Critic网络的作用是评价Actor网络的动作并更新Actor网络中的参数；第二层网络通过DDPG中的Actor网络可以基于一个时隙的非完美全局信道状态信息输出由确定性功率分配动作组成的功率分配策略。因此，采用DQN网络、DDQN网络或Dueling DQN可以更快的学习到最优的子信道动作，再结合DDPG处理连续变量类型的资源(信道功率分配)，可以获得相比现有算法处理不含中断概率约束的非凸优化目标，具有更快的收敛速率和更高的频谱效率。According to one embodiment of the present invention, the initial resource allocation system includes: a channel allocation model (also referred to as the first layer network in the embodiment of the present invention) and a power allocation model (also referred to as the second layer network in the embodiment of the present invention); the channel allocation model is used to predict the channel allocation strategy of a time slot based on the imperfect global channel state information of the time slot; preferably, the channel allocation model is configured as a DQN network, a DDQN network or a Dueling DQN network; the power allocation model is used to predict the power allocation strategy of the time slot based on the imperfect global channel state information of a time slot, and preferably, the power allocation model is configured as a DDPG network. It should be noted that the channel allocation subproblem belongs to a discrete type task, and the power allocation subproblem belongs to a continuous type task. The introduction of quantization errors can be avoided by using the two-layer learning network architecture composed of the channel allocation model and the power allocation model described in the previous embodiment. Among them, for channel allocation, a DQN network, a DDQN network or a Dueling DQN network is used to process discrete variable type resources; for power allocation, the channel power is a continuous scalar limited by P _max (for some algorithms, such as the value-based DQN algorithm, the action space must be limited, the transmission power may be discretized, and the discretization of continuous variables will inevitably lead to quantization errors). In order to avoid the discretization of channel power, the second layer network in the present invention adopts a DDPG network, which includes an Actor network and a Critic network. The role of the Actor network is to output the allocated power, and the role of the Critic network is to evaluate the action of the Actor network and update the parameters in the Actor network; the second layer network can output a power allocation strategy composed of deterministic power allocation actions based on the imperfect global channel state information of a time slot through the Actor network in DDPG. Therefore, the use of a DQN network, a DDQN network or a Dueling DQN can learn the optimal sub-channel action faster, and combined with DDPG to process continuous variable type resources (channel power allocation), it can obtain a non-convex optimization target that does not contain interruption probability constraints compared to the existing algorithm, with a faster convergence rate and higher spectrum efficiency.

根据本发明的一个实施例，训练初始资源分配系统的过程中，所述步骤S4包括步骤S41、步骤S42和步骤S43。其中，步骤S41包括：获取输入时隙的非完美全局信道状态信息并执行如下步骤：S411、由信道分配模型根据输入时隙的非完美全局信道状态信息预测该输入时隙的信道分配策略，基于所述预测的信道分配策略更新输入时隙的非完美全局信道状态信息，并且由功率分配模型根据更新后的输入时隙的非完美全局信道状态信息预测该输入时隙的功率分配策略；由预测的输入时隙的信道分配策略和功率分配策略与无线网络进行交互得到输入时隙的下一时隙的非完美全局信道状态信息，以及由信道分配模型根据输入时隙的下一时隙的非完美全局信道状态信息预测输入时隙的下一时隙的信道分配策略，基于所述输入时隙的下一时隙的信道分配策略更新输入时隙的下一时隙的信道分配策略；S412、基于输入时隙的信道分配策略和功率分配策略计算该输入时隙的频谱效率奖励；S413、以输入时隙的非完美全局信道状态信息、输入时隙的信道分配策略、输入时隙的频谱效率奖励、输入时隙的下一时隙的非完美全局信道状态信息为一条信道分配经验存入信道选择回放池；以更新后的输入时隙的非完美全局信道状态信息、输入时隙的功率分配策略、输入时隙的频谱效率奖励、更新后的输入时隙的下一时隙的非完美全局信道状态信息为一条功率分配经验存入功率选择回放池。步骤S42包括：以上一次输入时隙的下一时隙的非完美全局信道状态信息为新的输入时隙的非完美全局信道状态信息。步骤S43包括：基于信道选择回放池中的信道分配经验和功率选择回放池中的功率分配更新初始资源分配系统参数直至收敛。需要说明的是，基于输入时隙的信道分配策略和功率分配策略计算该输入时隙的频谱效率奖励，该频谱效率奖励代表信道分配和功率分配对优化目标的整体贡献，可以使得信道分配模型和功率分配模型共享相同的奖励函数，以及以最大化无线网络的频谱效率的目标协同工作。According to one embodiment of the present invention, in the process of training the initial resource allocation system, step S4 includes step S41, step S42 and step S43. Wherein, step S41 includes: obtaining the non-perfect global channel state information of the input time slot and executing the following steps: S411, the channel allocation model predicts the channel allocation strategy of the input time slot according to the non-perfect global channel state information of the input time slot, updates the non-perfect global channel state information of the input time slot based on the predicted channel allocation strategy, and the power allocation model predicts the power allocation strategy of the input time slot according to the updated non-perfect global channel state information of the input time slot; the predicted channel allocation strategy and power allocation strategy of the input time slot interact with the wireless network to obtain the non-perfect global channel state information of the next time slot of the input time slot, and the channel allocation model predicts the channel allocation strategy of the next time slot of the input time slot according to the non-perfect global channel state information of the next time slot of the input time slot. allocation strategy, update the channel allocation strategy of the next time slot of the input time slot based on the channel allocation strategy of the next time slot of the input time slot; S412, calculate the spectrum efficiency reward of the input time slot based on the channel allocation strategy and power allocation strategy of the input time slot; S413, store the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the non-perfect global channel state information of the next time slot of the input time slot as a channel allocation experience in the channel selection playback pool; store the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the non-perfect global channel state information of the next time slot of the input time slot as a power allocation experience in the power selection playback pool. Step S42 includes: using the non-perfect global channel state information of the next time slot of the previous input time slot as the new non-perfect global channel state information of the input time slot. Step S43 includes: updating the initial resource allocation system parameters based on the channel allocation experience in the channel selection playback pool and the power allocation in the power selection playback pool until convergence. It should be noted that the spectral efficiency reward of the input time slot is calculated based on the channel allocation strategy and power allocation strategy of the input time slot. The spectral efficiency reward represents the overall contribution of channel allocation and power allocation to the optimization objective, which enables the channel allocation model and the power allocation model to share the same reward function and work together with the goal of maximizing the spectral efficiency of the wireless network.

根据本发明的一个实施例，在所述步骤S43中，当信道选择回放池中有一条信道分配经验时即开始更新信道分配模型的参数；当功率选择回放池中有一条功率分配经验即开始更新功率分配模型的参数。According to one embodiment of the present invention, in step S43, when there is a channel allocation experience in the channel selection playback pool, the parameters of the channel allocation model are updated; when there is a power allocation experience in the power selection playback pool, the parameters of the power allocation model are updated.

根据本发明的一个实施例，在所述步骤S43中，当信道选择回放池中的信道分配经验达到预设条数的经验后对信道分配模型的参数进行多次更新直至收敛，其中，每次更新时从信道选择回放池进行随机采样获得多条信道分配经验，并基于采样到的信道分配经验采用梯度下降的方式更新信道分配模型的参数；当功率选择回放池中的功率分配经验达到预设条数的经验后对功率分配模型的参数进行多次更新直至收敛，其中，每次更新时从功率选择回放池进行随机采样获得多条功率分配经验，并基于采样到的功率分配经验采用梯度下降的方式更新功率分配模型的参数。需要说明的是，当信道选择回放池中的经验条数达到阈值时，存入信道选择回放池中的信道分配经验以替换当前信道选择回放池中最先存入的信道分配经验的方式(即先入先出的方式)存入信道选择回放池中；当功率选择回放池中的经验条数达到阈值时，存入功率选择回放池中的功率分配经验以替换当前信道选择回放池中最先存入的功率分配经验的方式(即先入先出的方式)存入功率选择回放池中。其中，信道选择回放池或者功率选择回放池中的经验达到预设条数才开始用于随机采样，可以加快信道分配模型或者功率分配模型的收敛速度；经验条数的阈值设置，可以降低模型训练的硬件需求；信道选择回放池或者功率选择回放池中采用先入先出的方式进行经验的存储，可以使得新产生的较优的经验可以很好的替换相对较差的经验，以此保证信道选择回放池或者功率选择回放池中经验在采样时，处于一个最优经验的存储状态，从而进一步地加快信道分配模型或者功率分配模型的收敛速度。According to one embodiment of the present invention, in step S43, when the channel allocation experience in the channel selection replay pool reaches a preset number of experiences, the parameters of the channel allocation model are updated multiple times until convergence, wherein each update is performed by randomly sampling from the channel selection replay pool to obtain multiple channel allocation experiences, and the parameters of the channel allocation model are updated by gradient descent based on the sampled channel allocation experiences; when the power allocation experience in the power selection replay pool reaches a preset number of experiences, the parameters of the power allocation model are updated multiple times until convergence, wherein each update is performed by randomly sampling from the power selection replay pool to obtain multiple power allocation experiences, and the parameters of the power allocation model are updated by gradient descent based on the sampled power allocation experiences. It should be noted that when the number of experiences in the channel selection replay pool reaches a threshold, the channel allocation experience stored in the channel selection replay pool is stored in the channel selection replay pool in a manner that replaces the channel allocation experience first stored in the current channel selection replay pool (i.e., a first-in, first-out manner); when the number of experiences in the power selection replay pool reaches a threshold, the power allocation experience stored in the power selection replay pool is stored in the power selection replay pool in a manner that replaces the power allocation experience first stored in the current channel selection replay pool (i.e., a first-in, first-out manner). Among them, the experience in the channel selection replay pool or the power selection replay pool reaches a preset number before it is used for random sampling, which can speed up the convergence speed of the channel allocation model or the power allocation model; the threshold setting of the number of experiences can reduce the hardware requirements for model training; the channel selection replay pool or the power selection replay pool adopts a first-in-first-out method to store experiences, which can make the newly generated better experience can replace the relatively worse experience well, so as to ensure that the experience in the channel selection replay pool or the power selection replay pool is in an optimal experience storage state when sampling, thereby further accelerating the convergence speed of the channel allocation model or the power allocation model.

为了更好的训练初始资源分配系统，在训练初始资源分配系统的过程中，首先获取输入时隙的非完美全局信道状态信息。根据本发明的一个实施例，在训练初始资源分配系统的过程中，输入时隙的非完美全局信道状态信息包括多条链路在输入时隙选择不同子信道的非完美全局信道状态信息(本发明有时也称状态集合)，其中一条链路在输入时隙选择不同子信道后的非完美全局信道状态信息为一个状态集，所述多条链路在输入时隙选择不同子信道后的非完美全局信道状态信息为：In order to better train the initial resource allocation system, in the process of training the initial resource allocation system, the non-perfect global channel state information of the input time slot is first obtained. According to one embodiment of the present invention, in the process of training the initial resource allocation system, the non-perfect global channel state information of the input time slot includes non-perfect global channel state information of multiple links selecting different sub-channels in the input time slot (sometimes referred to as a state set in the present invention), wherein the non-perfect global channel state information of one link after selecting different sub-channels in the input time slot is a state set, and the non-perfect global channel state information of the multiple links after selecting different sub-channels in the input time slot is:

其中，

in,

其中，

表示第k条链路在时隙t选择第n个子信道的状态集，

表示第k条链路在时隙t选择第n个子信道的信道功率，

表示第k条链路在时隙t-1选择第n个子信道后的标识值,

表示第k条链路在时隙t-1选择第n个子信道的功率，

表示第k条链路在时隙t-1对应的频谱效率，

与总干扰功率的比率比值在所有信道上的排序值，

代表信道估计误差的方差，

是考虑阴影衰落和几何衰减的大规模衰落分量，

表示均值为0，方差为in,

represents the spectrum efficiency of the kth link at time slot t-1,

The ratio of the total interference power to the ranking value of all channels,

represents the variance of the channel estimation error,

The mean is 0 and the variance is

的复高斯分布。根据本发明的一个实施例，输入时隙的非完美全局信道状态信息为状态集合时，第一层网络被配置为与链路的条数相同个数的信道分配模型，第二层网络被配置为与链路的条数相同个数的功率分配模型，其中，由一个信道分配模型和一个功率分配模型处理一个状态集。根据本发明的一个实施例，输入时隙的非完美全局信道状态信息为状态集合时，第一层网络被配置为一个信道分配模型，第二层网络被配置为一个功率分配模型，其中由一个信道分配模型和一个功率分配模型依次处理状态集合中的每个状态集。第一层网络被配置为与链路的条数相同个数的信道分配模型以及第二层网络被配置为与链路的条数相同个数的功率分配模型，可以提高初始资源分配模型的处理速度。根据本发明的一个实施例，在步骤S411中，输入时隙的非完美全局信道状态信息为状态集合时，基于所述预测的信道分配策略通过如下方式更新输入时隙的非完美全局信道状态信息：基于信道分配模型预测的信道分配策略，从状态集合中选择执行所述预测的信道分配策略对应的状态集作为更新输入时隙的非完美全局信道状态信息。

Complex Gaussian distribution. According to one embodiment of the present invention, when the imperfect global channel state information of the input time slot is a state set, the first layer network is configured as a channel allocation model with the same number as the number of links, and the second layer network is configured as a power allocation model with the same number as the number of links, wherein one state set is processed by one channel allocation model and one power allocation model. According to one embodiment of the present invention, when the imperfect global channel state information of the input time slot is a state set, the first layer network is configured as a channel allocation model, and the second layer network is configured as a power allocation model, wherein each state set in the state set is processed in sequence by one channel allocation model and one power allocation model. The first layer network is configured as a channel allocation model with the same number as the number of links and the second layer network is configured as a power allocation model with the same number as the number of links, which can improve the processing speed of the initial resource allocation model. According to one embodiment of the present invention, in step S411, when the non-perfect global channel state information of the input time slot is a state set, the non-perfect global channel state information of the input time slot is updated in the following manner based on the predicted channel allocation strategy: based on the channel allocation strategy predicted by the channel allocation model, a state set corresponding to the predicted channel allocation strategy is selected from the state set as the non-perfect global channel state information for updating the input time slot.

需要说明的是，状态集的选择对初始资源分配系统的训练效果很重要，状态集的选择要体现非完美CSI的特征，也就意味着必须选择相关信道状态信息能够体现非完美CSI的信息作为状态集中的元素，以规避非完美全局信道状态信息中非必要的信道状态信息，来提高初始资源分配模型的训练效果。其中，信道估计误差的方差、估计的信道增益(独立信道增益)、每条链路在某个时隙选择子信道对应的估计小尺度衰落分量与总干扰功率的比值在所有信道上的排序值是最能体现非完美CSI对应的信道状态信息的关键特征，因此，本发明中在状态集中引入了这些信息，并针对性的设计了相应的奖励使得资源分配模型系统在引入信道估计误差下能获得更优的增益。It should be noted that the selection of the state set is very important for the training effect of the initial resource allocation system. The selection of the state set must reflect the characteristics of the imperfect CSI, which means that the relevant channel state information that can reflect the imperfect CSI must be selected as an element in the state set to avoid unnecessary channel state information in the imperfect global channel state information to improve the training effect of the initial resource allocation model. Among them, the variance of the channel estimation error, the estimated channel gain (independent channel gain), and the ranking value of the ratio of the estimated small-scale fading component corresponding to the sub-channel selected by each link in a certain time slot to the total interference power on all channels are the key features that best reflect the channel state information corresponding to the imperfect CSI. Therefore, the present invention introduces this information in the state set, and specifically designs the corresponding rewards so that the resource allocation model system can obtain better gains when the channel estimation error is introduced.

根据本发明的一个实施例，采用如下方式计算频谱效率奖励：According to one embodiment of the present invention, the spectrum efficiency bonus is calculated in the following manner:

其中，

in,

其中，

表示第k条链路在时隙t选择第n个子信道的外部干扰性，

表示第k′条链路在时隙t选择第n个子信道对应的频谱效率。需要说明的是，通过定义干扰的权重系数，可以降低了奖励函数的方差，优选的，φ＝1。in,

It represents the spectral efficiency corresponding to the k′th link selecting the nth subchannel in time slot t. It should be noted that by defining the weight coefficient of interference, the variance of the reward function can be reduced. Preferably, φ=1.

为了更好讲解本发明的初始资源分配系统的参数更新过程，以下以Dueling DQN网络和DDPG网络组成的初始资源分配系统并且采用从信道选择回放池和功率选择回放池随机采样的方式进行参数更新的过程为例进行展开说明。需要说明的是，选择Dueling DQN网络可以区分频谱效率增益是取决于所采取的子信道动作，还是仅仅因为输入的状态集合较好，可以很好的完成信道分配任务，以进一步提高初始资源分配系统收敛后的性能。In order to better explain the parameter update process of the initial resource allocation system of the present invention, the following is an example of an initial resource allocation system composed of a Dueling DQN network and a DDPG network, and the parameter update process is performed by random sampling from a channel selection playback pool and a power selection playback pool. It should be noted that the selection of the Dueling DQN network can distinguish whether the spectral efficiency gain depends on the sub-channel action taken or simply because the input state set is better, and the channel allocation task can be completed well, so as to further improve the performance of the initial resource allocation system after convergence.

如图2所示，由Dueling DQN网络、DDPG(包括Actor网络)和Critic Network(Critic网络)组成初始资源分配系统，其中，Actor Network(Actor 网络)和CriticNetwork(Critic网络)组成DDPG网络。需要说明的是，上述的信道分配策略对应的是在每个时隙每条链路对子信道的信道选择动作，功率分配策略对应的是在每个时隙每条链路对所选子信道的功率分配动作。以下将以初始资源分配系统产生一条信道分配经验和一条功率分配经验的生成过程来示意性的讲解经验产生的过程：第k条链路在时隙t的状态集合

输入Dueling DQN网络，Dueling DQN网络预测出子信道的信道选择动作

在预测的子信道的信道选择动作

后，基于

更新

得到更新后的状态集

(即，

中

对应的状态集

)并以

作为DDPG网络的输入，DDPG网络中的Actor网络基于

预测出子信道的功率分配动作

基站在时隙t开始时依次执行两种动作(

和

)以确定其关联的子信道和该子信道的发射功率，基站依次执行两种动作(

和

)后，并与无线网络环境(即非完美CSI下超密集网络环境)交互后会产生下一时隙t+1的状态集合

以及基于信道分配模型预测的子信道的信道选择动作

以及功率分配模型预测的子信道的信道分配动作

计算频谱效率奖励

输入Dueling DQN网络会预测出子信道的选择动作

在预测的子信道的信道选择动作

后，基于

更新

得到更新后的状态集

(即，

中

对应的状态集

)。其中，将

作为一条信道分配经验存入信道选择回放池(图2中所示)以及将

作为一条功率分配经验存入功率选择回放池(图2中所示)。以此在训练初始资源分配系统的过程中不断产生信道分配经验和功率分配经验。As shown in Figure 2, the initial resource allocation system is composed of Dueling DQN network, DDPG (including Actor network) and Critic Network, where Actor Network and CriticNetwork constitute DDPG network. It should be noted that the above-mentioned channel allocation strategy corresponds to the channel selection action of each link for the sub-channel in each time slot, and the power allocation strategy corresponds to the power allocation action of each link for the selected sub-channel in each time slot. The following will use the generation process of a channel allocation experience and a power allocation experience generated by the initial resource allocation system to schematically explain the process of experience generation: the state set of the kth link in time slot t

Input Dueling DQN network, Dueling DQN network predicts the channel selection action of the subchannel

Channel selection action on the predicted subchannel

Afterwards, based on

renew

Get the updated state set

(Right now,

middle

The corresponding state set

) and

As the input of the DDPG network, the Actor network in the DDPG network is based on

Predict the power allocation action of the subchannel

The base station performs two actions in sequence at the beginning of time slot t (

and

) to determine its associated subchannel and the transmission power of the subchannel, the base station performs two actions in sequence (

and

) and interacts with the wireless network environment (i.e., ultra-dense network environment under non-perfect CSI) to generate the state set for the next time slot t+1

and channel selection actions for subchannels predicted by the channel allocation model

And the channel allocation actions of the subchannels predicted by the power allocation model

Calculating Spectral Efficiency Bonus

Input Dueling DQN network will predict the selected action of the subchannel

Channel selection action on the predicted subchannel

Afterwards, based on

renew

Get the updated state set

(Right now,

middle

The corresponding state set

).

As a channel allocation experience, it is stored in the channel selection playback pool (shown in Figure 2) and

As a power allocation experience, it is stored in the power selection playback pool (as shown in FIG2 ). In this way, channel allocation experience and power allocation experience are continuously generated during the process of training the initial resource allocation system.

在参数更新过程中，基于信道选择回放池中的经验更新dueling DQN的参数至收敛以及基于功率选择回放池中的经验更新DDPG网络的参数至收敛。需要说明的是，DDPG中的Critic网络仅在训练期间使用，实际部署时只需要Actor网络进行功率分配，即以Dueling DQN网络和DDPG网络组成的初始资源分配系统为例，最终的资源分配系统为：训练至收敛Dueling DQN网络与Actor网络。During the parameter update process, the parameters of the Dueling DQN are updated to convergence based on the experience in the channel selection replay pool, and the parameters of the DDPG network are updated to convergence based on the experience in the power selection replay pool. It should be noted that the Critic network in DDPG is only used during training. In actual deployment, only the Actor network is needed for power allocation. That is, taking the initial resource allocation system composed of the Dueling DQN network and the DDPG network as an example, the final resource allocation system is: Dueling DQN network and Actor network trained to convergence.

由于训练dueling DQN和DDPG网络至收敛是本领域技术人员均知的过程，此处不在赘述收敛的具体条件，以下继续以Dueling DQN网络和DDPG网络组成的初始资源分配系统讲解初始资源分配系统的一次参数更新过程。优选的，采用对信道选择回放池进行随机采样获得多条信道分配经验并基于采样到的信道分配经验计算梯度更新Dueling DQN网络的参数的方式，以及从功率选择回放池进行随机采样获得多条功率分配经验并基于采样到的功率分配经验计算梯度更新DDPG网络的参数方式，来讲解初始资源分配系统的一次参数更新过程。Since training the dueling DQN and DDPG networks to convergence is a process known to those skilled in the art, the specific conditions for convergence are not described here. The following continues to explain the parameter update process of the initial resource allocation system using the initial resource allocation system composed of the Dueling DQN network and the DDPG network. Preferably, the parameter update process of the initial resource allocation system is explained by randomly sampling the channel selection playback pool to obtain multiple channel allocation experiences and calculating the gradient based on the sampled channel allocation experience to update the parameters of the Dueling DQN network, and by randomly sampling the power selection playback pool to obtain multiple power allocation experiences and calculating the gradient based on the sampled power allocation experience to update the parameters of the DDPG network.

优选的，从信道选择回放池随机采样信道分配经验得到信道分配经验集合B₁，并采用以下规则计算梯度来更新Dueling DQN网络的参数：Preferably, the channel allocation experience is randomly sampled from the channel selection playback pool to obtain a channel allocation experience set B ₁ , and the following rules are used to calculate the gradient to update the parameters of the Dueling DQN network:

其中，

in,

其中，θ_c表示Dueling DQN网络中隐藏层的可训练参数，β表示值函数V的全连接层的可训练参数，χ表示优势函数A的全连接层的可训练参数，B₁表示随机采样的信道分配经验集合，其中，

表示B₁中的一条信道分配经验，|B₁|表示信道分配经验集合的经验条数,

表示Dueling DQN的目标值，

表示时隙t+1的状态信息集合，Q′

表示时隙t的Q函数值，γ′表示Dueling DQN网络中的折扣系数，

分别表示Dueling DQN网络中目标网络的参数，

表示时隙t+1的子信道选择动作，

表示处在

时的价值函数,

表示在状态

时选择动作

优势函数值，

表示在状态

时选择动作

优势函数值，|A|表示

的求和计算的次数。Where _θc represents the trainable parameters of the hidden layer in the Dueling DQN network, β represents the trainable parameters of the fully connected layer of the value function V, χ represents the trainable parameters of the fully connected layer of the advantage function A, and _B1 represents the randomly sampled channel allocation experience set, where

represents a channel allocation experience in B ₁ , |B ₁ | represents the number of experiences in the channel allocation experience set,

represents the target value of Dueling DQN,

represents the state information set of time slot t+1, Q′

represents the Q function value at time slot t, γ′ represents the discount coefficient in the Dueling DQN network,

They represent the parameters of the target network in the Dueling DQN network,

represents the subchannel selection action for time slot t+1,

Indicates being in

The value function when

Indicates in status

Select action

Advantage function value,

Indicates in status

Select action

Advantage function value, |A| represents

The number of summations to calculate.

对于DDPG网络的更新，DDPG网络使用神经网络来近似行为值函数Q(s,a)(Critic网络)和动作函数u_θ(s)(Actor网络)。For the update of the DDPG network, the DDPG network uses a neural network to approximate the behavior value function Q(s,a) (Critic network) and the action function u _θ (s) (Actor network).

优选的，为了更新Critic网络的网络参数θ_Q，采用时间差分(TD，temporaldifference)误差法，从功率选择回放池随机采样功率分配经验得到功率分配经验集合B₂并采用以下规则下计算最小化均方误差来更新Critic网络的参数：Preferably, in order to update the network parameter θ _Q of the Critic network, a temporal difference (TD) error method is used to randomly sample power allocation experience from the power selection playback pool to obtain a power allocation experience set B ₂ and calculate the minimum mean square error under the following rules to update the parameters of the Critic network:

其中，

in,

其中，B₂表示随机采样的功率分配经验集合,其中，

表示功率分配经验集合中一条功率分配经验,

表示Critic网络的Q函数在输入

时的函数值，

(Actor网络激活函数确定的)表示Actor网络第k条链路在时隙t经过Dueling DQN输出固定子信道的信道选择动作

后输出的功率的确定的功率分配动作,|B₂|表示随机采样的功率分配经验集合的经验条数，y′表示DDPG网络的目标值，γ表示DDPG网络中的折扣系数，

表示Critic网络的Q函数在输入

时的函数值。Where B ₂ represents the randomly sampled power allocation experience set, where

represents a power allocation experience in the power allocation experience set,

Indicates that the Q function of the Critic network is input

The function value when

(determined by the Actor network activation function) represents the channel selection action of the kth link of the Actor network at time slot t after Dueling DQN outputs a fixed subchannel

The power allocation action of the output power is determined, |B ₂ | represents the number of experience items in the randomly sampled power allocation experience set, y′ represents the target value of the DDPG network, γ represents the discount coefficient in the DDPG network,

Indicates that the Q function of the Critic network is input

The function value at .

基于采集到功率分配经验集合B₂并采用如下规则计算梯度来更新Actor网络的参数：Based on the collected power allocation experience set _B2 , the following rules are used to calculate the gradient to update the parameters of the Actor network:

其中，

表示Critic网络的Q函数在输入为

时的函数值，

表示Critic网络的Q函数在输入为

时的函数值。in,

The Q function of the Critic network is

The function value when

The Q function of the Critic network is

The function value at .

需要说明的是，对于DQN网络或DDQN网络与DDPG网路组成的初始分配网络时，DQN网络或DDQN网络的更新主要是与Dueling DQN网络的Q函数的计算方式不同，DQN网络或DDQN网络的Q函数及更新过程是本领域技术人员均知的，此处不在赘述DQN网络或DDQN网络的参数更新过程。It should be noted that, for the initial allocation network composed of a DQN network or a DDQN network and a DDPG network, the update of the DQN network or the DDQN network is mainly different from the calculation method of the Q function of the Dueling DQN network. The Q function and update process of the DQN network or the DDQN network are well known to those skilled in the art, and the parameter update process of the DQN network or the DDQN network will not be repeated here.

此外，在上述无线网络资源分配系统的构建方法的基础上，本发明还提供一种无线网络资源管理方法，如图3所示，一种无线网络资源管理方法包括对采用构建无线网络资源分配系统的方法中具有中断概率约束的非凸优化目标的建立过程对整个无线网络环境进行数学化描述，并基于此形成具有中断概率约束的非凸优化目标(即图3中的模型建立)，然后将有中断概率约束的非凸优化目标转换为不含中断概率约束的非凸优化目标(即图3中的参数变换)，采用初始资源分配系统进行求解训练(即图3中的双层网络架构)，当初始资源分配系统收敛时，即可以使用资源分配系统得到资源分配方案来分配无线通信系统中的无线网络资源。根据本发明的一个实施例，一种无线网络资源管理方法包括：T1、获取无线通信系统在上一时隙的无线网络状态；T2、基于步骤T1获得的上一时隙的无线网络状态，采用上述的构建无线网络资源分配系统的方法得到的无线资源分配系统预测下一时刻的资源分配策略；T3、基于步骤T2得到的下一时刻的资源分配策略分配无线通信系统中的无线网络资源。In addition, based on the above-mentioned method for constructing a wireless network resource allocation system, the present invention also provides a wireless network resource management method, as shown in FIG3, a wireless network resource management method includes mathematically describing the entire wireless network environment by using the establishment process of a non-convex optimization target with an interruption probability constraint in the method for constructing a wireless network resource allocation system, and based on this, forming a non-convex optimization target with an interruption probability constraint (i.e., model establishment in FIG3), and then converting the non-convex optimization target with an interruption probability constraint into a non-convex optimization target without an interruption probability constraint (i.e., parameter transformation in FIG3), using an initial resource allocation system for solution training (i.e., a two-layer network architecture in FIG3), when the initial resource allocation system converges, the resource allocation system can be used to obtain a resource allocation scheme to allocate wireless network resources in the wireless communication system. According to an embodiment of the present invention, a wireless network resource management method includes: T1, obtaining a wireless network state of the wireless communication system in the previous time slot; T2, based on the wireless network state of the previous time slot obtained in step T1, using the wireless resource allocation system obtained by the above-mentioned method for constructing a wireless network resource allocation system to predict a resource allocation strategy at the next moment; T3, allocating wireless network resources in the wireless communication system based on the resource allocation strategy at the next moment obtained in step T2.

在一种无线网络资源管理方法的基础上，本发明还提供一种无线通信系统，所述系统包括多个基站，所述每个基站包括无线资源管理单元，所述无线资源管理单元被配置采用上述的无线网络资源管理方法分配基站中的无线网络资源。其中，无线资源管理单元中配置的无线网络资源分配系统在训练阶段采用集中训练的模式，即选用多小区多用户网络场景中的状态集合进行训练；在部署阶段则将集中训练得到的训练完成的无线网络资源分配系统分布式地部署到每个基站。以此来提高无线通信系统分配无线网络资源的效果。On the basis of a wireless network resource management method, the present invention also provides a wireless communication system, the system includes multiple base stations, each base station includes a wireless resource management unit, and the wireless resource management unit is configured to use the above wireless network resource management method to allocate wireless network resources in the base station. Among them, the wireless network resource allocation system configured in the wireless resource management unit adopts a centralized training mode in the training stage, that is, a state set in a multi-cell multi-user network scenario is selected for training; in the deployment stage, the trained wireless network resource allocation system obtained by centralized training is distributedly deployed to each base station. In this way, the effect of allocating wireless network resources by the wireless communication system is improved.

三、实验验证3. Experimental Verification

为了更好的说明本发明的技术效果，通过以下仿真实验进行验证。In order to better illustrate the technical effect of the present invention, verification is performed through the following simulation experiments.

首先介绍仿真参数设置：将无线网络场景设置为一个下行链路具有多小区多用户网络场景，其中K个链路分布在M个小区中且共享N个正交子信道，即每个小区有M/K个用户；对于小区i，基站BS位于小区i的中心并服务于随机分布在小区内的M/K个用户；大规模路径损耗由128.1+37.6log₁₀(d)计算，其中d是从发射器到接收器的距离，以公里为单位；用户接收到的信号与干扰加噪声比SINR的上限设置为30dB，噪声功率σ²设置为-114dBm，需要优化问题选择最大化频谱效率，本发明的初始分配网络采用决斗DQN网络(即Dueling DQN网络)和DDPG网络，Dueling DQN网络和DDPG网络均具有三个隐藏层且分别具有200、200和100个神经元。仿真参数除上述设置外，其余仿真参数详细设置参数如表1所示。First, the simulation parameter settings are introduced: the wireless network scenario is set as a downlink with a multi-cell multi-user network scenario, where K links are distributed in M cells and share N orthogonal subchannels, that is, each cell has M/K users; for cell i, the base station BS is located at the center of cell i and serves M/K users randomly distributed in the cell; the large-scale path loss is calculated by 128.1+37.6log ₁₀ (d), where d is the distance from the transmitter to the receiver in kilometers; the upper limit of the signal to interference plus noise ratio SINR received by the user is set to 30dB, and the noise power σ ² is set to -114dBm. The optimization problem needs to select the maximum spectrum efficiency. The initial allocation network of the present invention adopts the dueling DQN network (i.e., Dueling DQN network) and the DDPG network. The Dueling DQN network and the DDPG network both have three hidden layers and 200, 200 and 100 neurons respectively. Simulation parameters In addition to the above settings, the detailed setting parameters of the remaining simulation parameters are shown in Table 1.

表1Table 1

仿真参数Simulation parameters 值value 仿真参数Simulation parameters 值value 小区半径Cell radius 200m200m 链路的功率阈值Link power threshold 38dB38dB 中断概率Probability of interruption 0.10.1 信道估计误差方差Channel estimation error variance 0.10.1 时隙间隔Time slot interval 20ms20ms 干扰的权重系数Interference weight coefficient 11

在本仿真实验中，采用本发明提供构建无线网络资源分配系统的方法(以下简称本专利所提算法)与其他一些基线算法一起进行三组对照实验(测试训练阶段的收敛性、泛化能力及频谱效率性能、算法的频谱效率在不同子信道数量下的性能)，其中基线算法包括随机算法和背景技术中参考文献[5]记载的random、FP算法(分式规划算法)、参考文献[7]记载的joint learning算法(联合学习算法)、参考文献[9]记载的分布学习算法；具体而言，分式规划算法包括了随机分配的子信道和功率值和传统的基于模型驱动的算法，具有较高的计算复杂度；联合学习算法采用一个DQN网络优化两个变量即子信道和功率；分布学习算法在完美CSI下采用DQN优化子信道以及采用DDPG优化功率。In this simulation experiment, the method for constructing a wireless network resource allocation system provided by the present invention (hereinafter referred to as the algorithm proposed in this patent) is used together with some other baseline algorithms to conduct three groups of control experiments (testing the convergence, generalization ability and spectrum efficiency performance of the training phase, and the performance of the spectrum efficiency of the algorithm under different numbers of sub-channels). The baseline algorithms include random algorithms and the random, FP algorithms (fractional programming algorithms) recorded in reference [5] in the background technology, the joint learning algorithm recorded in reference [7], and the distributed learning algorithm recorded in reference [9]. Specifically, the fractional programming algorithm includes randomly assigned sub-channels and power values and traditional model-driven algorithms, and has a high computational complexity. The joint learning algorithm uses a DQN network to optimize two variables, namely, sub-channels and power. The distributed learning algorithm uses DQN to optimize sub-channels and DDPG to optimize power under perfect CSI.

在训练阶段，本专利所提算法和上述四种基线算法的训练过程包括20episode，每个episode包含2000个时隙，即算法在一个episode内经过固定的2000个时间步后停止算法训练及参数更新，并在每episode开始时设置新的用户分布并且重置学习率等参数，以此来使得本专利所提算法和上述四种基线算法收敛。以下将对本专利所提算法和上述四种基线算法对应的数据进行对比说明。In the training phase, the training process of the algorithm proposed in this patent and the above four baseline algorithms includes 20 episodes, each episode contains 2000 time slots, that is, the algorithm stops algorithm training and parameter update after a fixed 2000 time steps in an episode, and sets a new user distribution and resets parameters such as the learning rate at the beginning of each episode, so as to make the algorithm proposed in this patent and the above four baseline algorithms converge. The following will compare the data corresponding to the algorithm proposed in this patent and the above four baseline algorithms.

为了测试本专利所提算法和上述四种基线算法在训练阶段的收敛性，在每个episode中只重置了学习率等参数而没有更新用户的分布。当用户数量为25，基站数量和子信道数量为5时，本专利所提算法和上述四种基线算法的收敛性能如图4所示，可见本专利所提算法和上述四种基线算法在每个episode基本(随机算法和分式规划算法除外)都收敛到4.0bps/Hz附近。在收敛的算法中，本专利所提算法的收敛速度快于基于学习的基准算法(联合学习算法和分布学习算法)，此外，上述四种基线算法可实现的频谱效率低于本专利所提算法，且当迭代次数较少即5000-6000迭代时，上述四种基线算法的频谱效率均远低于所提的方案。由此，本专利所提算法在收敛速率方面具有显著优势。In order to test the convergence of the algorithm proposed in this patent and the above four baseline algorithms in the training phase, only the learning rate and other parameters were reset in each episode without updating the distribution of users. When the number of users is 25, the number of base stations and the number of subchannels are 5, the convergence performance of the algorithm proposed in this patent and the above four baseline algorithms is shown in Figure 4. It can be seen that the algorithm proposed in this patent and the above four baseline algorithms basically converge to around 4.0bps/Hz in each episode (except for the random algorithm and the fractional programming algorithm). Among the converged algorithms, the convergence speed of the algorithm proposed in this patent is faster than that of the learning-based benchmark algorithms (joint learning algorithm and distributed learning algorithm). In addition, the spectral efficiency that can be achieved by the above four baseline algorithms is lower than that of the algorithm proposed in this patent, and when the number of iterations is small, that is, 5000-6000 iterations, the spectral efficiency of the above four baseline algorithms is much lower than that of the proposed solution. Therefore, the algorithm proposed in this patent has a significant advantage in convergence rate.

在泛化能力及频谱效率性能测试中，由于真实的动态无线通信场景中的信道估计误差是随时间变化的，而通过频繁的在线训练来跟踪环境是不现实的。因此，算法的泛化能力在不断变化的环境中非常重要。将上述仿真参数设置中的信道估计误差方差设置为0.01时具有最佳结果的训练模型。在不同条件下进行测试，本专利所提算法和上述四种基线算法可实现的频谱效率与信道估计误差方差的关系如图5所示：随机算法和分式规划算法的性能随着信道估计误差的增大而退化；联合学习算法、分布学习算法算法和本专利所提算法的频谱效率随着信道估计误差的变化几乎保持不变。由此，本专利所提算法在不同的信道估计误差下具有更强的泛化能力且可以实现更高的频谱效率性能。In the generalization ability and spectrum efficiency performance test, since the channel estimation error in the real dynamic wireless communication scenario changes over time, it is unrealistic to track the environment through frequent online training. Therefore, the generalization ability of the algorithm is very important in a constantly changing environment. The training model with the best results when the channel estimation error variance in the above simulation parameter settings is set to 0.01. Tested under different conditions, the relationship between the spectrum efficiency and the channel estimation error variance achievable by the algorithm proposed in this patent and the above four baseline algorithms is shown in Figure 5: The performance of the random algorithm and the fractional programming algorithm degrades as the channel estimation error increases; the spectrum efficiency of the joint learning algorithm, the distributed learning algorithm and the algorithm proposed in this patent remains almost unchanged as the channel estimation error changes. Therefore, the algorithm proposed in this patent has stronger generalization ability under different channel estimation errors and can achieve higher spectrum efficiency performance.

在算法的频谱效率在不同子信道数量下的性能测试中，在信道估计误差方差为0.1时，本专利所提算法和上述四种基线算法的频谱效率在不同子信道数量下的性能，如图6所示：本专利所提算法和上述四种基线算法的可实现每条链路的平均频谱效率随着子信道数量的增加而逐渐增加，然而在定子信道的数量，本专利所提算法可以获得比上述四种基线算法都更高的频谱效率。即，本专利所提算法在多小区网络中更容易扩展，并且随着子信道数量的增加性能更好。In the performance test of the algorithm's spectral efficiency under different numbers of subchannels, when the channel estimation error variance is 0.1, the performance of the spectral efficiency of the algorithm proposed in this patent and the above four baseline algorithms under different numbers of subchannels is shown in Figure 6: The average spectral efficiency of each link that can be achieved by the algorithm proposed in this patent and the above four baseline algorithms gradually increases with the increase of the number of subchannels. However, at a fixed number of subchannels, the algorithm proposed in this patent can obtain higher spectral efficiency than the above four baseline algorithms. That is, the algorithm proposed in this patent is easier to expand in a multi-cell network, and performs better with the increase of the number of subchannels.

综上所述，由于现有的基于学习的方法没有针对非完美CSI进行设计，并且实际的通信环境中由于信道估计误差是客观存在且无法完全消除，直接采用现有的基于学习的算法在非完美CSI环境下所能达到的优化目标效果(即通信性能)较差且算法的收敛速度较低。其中，将估计的信道增益以及相应的误差(通过信道估计误差的方差体现)设计到初始资源分配系统的状态集，并针对性的设计了相应的奖励(例如频谱效率奖励)使得初始资源分配系统在信道估计误差下能获得更优的增益，使得初始资源分配系统收敛速率均好于基线算法，且可以显著提升频谱效率性能，即，提供的构建无线网络资源分配系统的方法获得最终的资源分配系统在实际的动态的无线通信环境(非完美CSI)中更加适用。In summary, since the existing learning-based methods are not designed for non-perfect CSI, and since the channel estimation error is objectively present and cannot be completely eliminated in the actual communication environment, the optimization target effect (i.e., communication performance) that can be achieved by directly using the existing learning-based algorithm in the non-perfect CSI environment is poor and the convergence speed of the algorithm is low. Among them, the estimated channel gain and the corresponding error (reflected by the variance of the channel estimation error) are designed into the state set of the initial resource allocation system, and the corresponding rewards (such as spectrum efficiency rewards) are designed in a targeted manner so that the initial resource allocation system can obtain better gain under the channel estimation error, so that the convergence rate of the initial resource allocation system is better than that of the baseline algorithm, and the spectrum efficiency performance can be significantly improved, that is, the method for constructing a wireless network resource allocation system provided to obtain the final resource allocation system is more applicable in the actual dynamic wireless communication environment (non-perfect CSI).

需要说明的是，虽然上文按照特定顺序描述了各个步骤，但是并不意味着必须按照上述特定顺序来执行各个步骤，实际上，这些步骤中的一些可以并发执行，甚至改变顺序，只要能够实现所需要的功能即可。It should be noted that although the above describes the various steps in a specific order, it does not mean that the various steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently or even in a different order as long as the required functions can be achieved.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。Computer readable storage medium can be a tangible device that holds and stores instructions used by an instruction execution device. Computer readable storage medium can include, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (non-exhaustive list) of computer readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, for example, a punch card or a protruding structure in a groove on which instructions are stored, and any suitable combination thereof.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present invention have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The selection of terms used herein is intended to best explain the principles of the embodiments, practical applications, or technical improvements in the market, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for constructing a wireless network resource allocation system, wherein the wireless network resource allocation system is used to obtain a wireless network resource allocation strategy according to a wireless network state, wherein the method comprises:

S1. Obtaining a non-convex optimization objective with an interruption probability constraint corresponding to wireless communication requirements in an environment with imperfect global channel state information;

S2, converting the non-convex optimization target obtained in step S1 to obtain a non-convex optimization target without interruption probability constraints;

S3, obtaining imperfect global channel state information of the wireless network;

S4. Taking the non-convex optimization objective in step S2 as the training objective and the non-perfect global channel state information in step S3 as input, the initial resource allocation system is trained to convergence by reinforcement learning, wherein the initial resource allocation system is a system built based on an agent and used to generate an action set based on a wireless network state, and the action set includes a channel allocation strategy and a power allocation strategy.

2. The method according to claim 1, wherein the wireless communication requirement is to maximize the spectrum efficiency of the wireless network, and the non-convex optimization objective with an interruption probability constraint is:

in,

in,

represents a set of subchannel indices,

represents the maximum spectral efficiency of the kth link selecting the nth subchannel in time slot y,

Indicates that in estimating the small-scale fading component

Under the condition

The probability of

A collection of power,

3. The method according to claim 2, characterized in that, in step S2, the non-convex optimization objective is transformed by parameter transformation to obtain a non-convex optimization objective without interruption probability constraints:

in,

Wherein, Ω ^t represents the average spectrum efficiency of the wireless network in time slot t after parameter transformation.

4. The method according to claim 3, characterized in that the initial resource allocation system comprises:

A channel allocation model, for predicting a channel allocation strategy for a time slot based on imperfect global channel state information of the time slot, configured as a DQN network, a DDQN network, or a Dueling DQN network;

A power allocation model is used to predict a power allocation strategy for a time slot based on imperfect global channel state information of the time slot, which is configured as a DDPG network.

5. The method according to claim 4, characterized in that the step S4 comprises:

S41, obtaining the imperfect global channel state information of the input time slot and executing the following steps:

S411, predicting the channel allocation strategy of the input time slot according to the non-perfect global channel state information of the input time slot by the channel allocation model, updating the non-perfect global channel state information of the input time slot based on the predicted channel allocation strategy, and predicting the power allocation strategy of the input time slot according to the updated non-perfect global channel state information of the input time slot by the power allocation model; interacting the predicted channel allocation strategy and power allocation strategy of the input time slot with the wireless network to obtain the non-perfect global channel state information of the next time slot of the input time slot,

and predicting, by the channel allocation model, a channel allocation strategy for the next time slot of the input time slot according to the imperfect global channel state information for the next time slot of the input time slot, and updating the channel allocation strategy for the next time slot of the input time slot based on the channel allocation strategy for the next time slot of the input time slot;

S412, calculating the spectrum efficiency reward of the input time slot based on the channel allocation strategy and power allocation strategy of the input time slot;

S413, using the non-perfect global channel state information of the input time slot, the channel allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the non-perfect global channel state information of the next time slot of the input time slot as a channel allocation experience to be stored in the channel selection replay pool; using the updated non-perfect global channel state information of the input time slot, the power allocation strategy of the input time slot, the spectrum efficiency reward of the input time slot, and the updated non-perfect global channel state information of the next time slot of the input time slot as a power allocation experience to be stored in the power selection replay pool;

S42, using the imperfect global channel state information of the next time slot of the last input time slot as the imperfect global channel state information of the new input time slot;

S43: Update initial resource allocation system parameters based on channel allocation experience in the channel selection playback pool and power allocation in the power selection playback pool until convergence.

6. The method according to claim 5 is characterized in that, in the step S43, when there is a channel allocation experience in the channel selection playback pool, the parameters of the channel allocation model are updated; when there is a power allocation experience in the power selection playback pool, the parameters of the power allocation model are updated.

7. The method according to claim 5, characterized in that in the step S43;

When the channel allocation experience in the channel selection replay pool reaches a preset number of experiences, the parameters of the channel allocation model are updated multiple times until convergence, wherein each update randomly samples from the channel selection replay pool to obtain multiple channel allocation experiences, and based on the sampled channel allocation experiences, the parameters of the channel allocation model are updated by gradient descent;

When the power allocation experience in the power selection replay pool reaches a preset number of experiences, the parameters of the power allocation model are updated multiple times until convergence. During each update, random sampling is performed from the power selection replay pool to obtain multiple power allocation experiences, and the parameters of the power allocation model are updated by gradient descent based on the sampled power allocation experiences.

8. The method according to any one of claims 4 to 7, characterized in that in step S41, the imperfect global channel state information of the input time slot includes the imperfect global channel state information of multiple links selecting different sub-channels in the input time slot:

in,

in,

represents the spectrum efficiency of the kth link at time slot t-1,

The ratio of the total interference power to the ranking value of all channels,

represents the co-channel interference under the subchannel allocation scheme and power allocation scheme of the previous time slot when the k-th link selects the n-th subchannel in time slot t, k ^′ represents other links different from k,

represents the variance of the channel estimation error,

The mean is 0 and the variance is

The complex Gaussian distribution of .

9. The method according to claim 8, characterized in that the spectrum efficiency bonus is calculated in the following manner:

in,

in,

represents the scheduling spectrum efficiency of the kth link selecting the nth subchannel in time slot t, φ is the weight coefficient of interference, k ^′ represents other links different from k,

represents the spectral efficiency of link k ^′ in the nth subchannel at time slot t without interference from the kth link,

It represents the spectral efficiency corresponding to the ^k′th link selecting the nth subchannel in time slot t.

10. The method according to claim 9, characterized in that the DDPG network includes an Actor network and a Critic network,

The final resource allocation system is: a DQN network, a DDQN network or a Dueling DQN network and an Actor network trained to convergence.

11. A wireless network resource management method, characterized in that the method comprises:

T1, obtaining the wireless network status of the wireless communication system in the previous time slot;

T2, based on the wireless network state of the previous time slot obtained in step T1, the resource allocation system obtained by the method described in any one of claims 1 to 10 predicts the resource allocation strategy at the next moment;

T3. Allocate wireless network resources in the wireless communication system based on the resource allocation strategy at the next moment obtained in step T2.

12. A wireless communication system, comprising a plurality of base stations, wherein each base station comprises a wireless resource management unit, and the wireless resource management unit is configured to allocate wireless network resources in the base station using the method as claimed in claim 11.

13. A computer-readable storage medium, characterized in that a computer program is stored thereon, and the computer program can be executed by a processor to implement the steps of any one of the methods of claims 1 to 11.

14. An electronic device, comprising:

one or more processors;

A storage device for storing one or more programs, which, when executed by the one or more processors, enables the electronic device to implement the steps of the method as claimed in any one of claims 1 to 11.