[go: up one dir, main page]

CN120639690A - UAV mission network message transmission routing planning system and method based on deep reinforcement learning - Google Patents

UAV mission network message transmission routing planning system and method based on deep reinforcement learning

Info

Publication number
CN120639690A
CN120639690A CN202511026371.1A CN202511026371A CN120639690A CN 120639690 A CN120639690 A CN 120639690A CN 202511026371 A CN202511026371 A CN 202511026371A CN 120639690 A CN120639690 A CN 120639690A
Authority
CN
China
Prior art keywords
network
situation
module
mission
link
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511026371.1A
Other languages
Chinese (zh)
Inventor
杨春刚
吴涵
李彤
李紫璇
李�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202511026371.1A priority Critical patent/CN120639690A/en
Publication of CN120639690A publication Critical patent/CN120639690A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/302Route determination based on requested QoS
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • H04L45/08Learning-based routing, e.g. using neural networks or artificial intelligence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Medical Informatics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

基于深度强化学习的无人机任务网络消息传输路由规划系统及方法,包括以下步骤,S101,用户输入无人机任务,发送至任务意图转译模块;S102,无人机任务通过任务意图转译模块,进行知识抽取实现规范化表征,经过策略映射模块的需求映射将意图输出为相关任务消息需求;S103,态势感知模块采集网络态势信息的关键数据并进行处理与管理形成资源态势数据库;S104,路由规划模块依据当前任务所需传输消息的QoS需求和链路态势信息,寻找最优传输路径;S105,若网络态势发生变化或链路出现故障时进行调整和优化。本发明提高网络消息传输效率、降低延迟和成本。同时当网络态势或者节点状态发生变化时进行调整和优化,保证任务消息及时可靠地传输。

A deep reinforcement learning-based network message transmission routing planning system and method for drone missions includes the following steps: S101: A user inputs a drone mission and sends it to a mission intent translation module; S102: The drone mission is extracted and standardized through knowledge extraction in the mission intent translation module, and the intent is output as relevant mission message requirements through demand mapping in the strategy mapping module; S103: The situation awareness module collects key data from network situation information and processes and manages it to form a resource situation database; S104: The routing planning module searches for the optimal transmission path based on the QoS requirements and link situation information of the message transmission required for the current mission; S105: Adjustments and optimizations are performed if the network situation changes or a link fails. This invention improves network message transmission efficiency, reduces latency, and reduces costs. Furthermore, adjustments and optimizations are performed when the network situation or node status changes, ensuring timely and reliable transmission of mission messages.

Description

Unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle task network communication, and particularly relates to an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning.
Background
Rapid advances in aviation, electronic, automatic control and artificial intelligence technology have facilitated rapid growth of unmanned aerial vehicle technology. Unmanned aerial vehicles are widely used in various civil and military fields, such as real-time surveillance, search and rescue, military reconnaissance, and hazardous location inspection. With the development of unmanned aerial vehicle technology and the advent of numerous miniaturized unmanned aerial vehicles, unmanned aerial vehicle applications are increasingly tending to perform work with multiple unmanned aerial vehicles or clusters. The cooperating unmanned aerial vehicles communicate using wireless communication devices and are dynamically networked in an ad hoc manner. Along with the increasingly strict task requirements and the increasingly complex environment, the number of nodes required by the unmanned aerial vehicle ad hoc network has greatly increased, and the characteristics of large network scale and extremely strong dynamic property are formed. The method is mainly characterized in that the number of messages transmitted between nodes is increased sharply, so that the communication burden of a network is increased, the routing overhead is increased remarkably, the efficiency and accuracy of the unmanned aerial vehicle for executing tasks are affected undoubtedly, and meanwhile, higher requirements are put on the high efficiency and dynamic adaptability of a routing algorithm, and the method aims at adapting to a rapidly-changing environment.
Through the above analysis, the problems and defects existing in the prior art are as follows:
(1) The prior art-route planning is performed by a traditional routing algorithm based on static network topology. The traditional routing algorithm mainly depends on a pre-constructed network topological graph, and performs routing selection according to network situation parameters, so that an optimal transmission path can be effectively selected for task messages, and the method has higher reliability and efficiency in a stable network environment. However, the conventional routing algorithm has significant limitations and insufficient adaptability to complex environment changes. Under the scene of unmanned aerial vehicle task network, the topological structure and the communication link moment of network change, and traditional routing algorithm is difficult to effectively catch these changes, can't realize dynamic self-adaptation, and then has influenced task message transmission's reliability and timeliness.
(2) In the second prior art, the network behavior and the response mode are continuously learned through a reinforcement learning algorithm so as to carry out message transmission route planning. When encountering dynamic change of the network or link failure, the message transmission path can be adjusted and optimized in real time, so that the task message can be reliably and timely transmitted, and the influence of the network state change on the message transmission is reduced. However, a limitation of this technique is that the high dynamics of the unmanned mission network results in a high level of complexity in its state, which is difficult to solve with conventional reinforcement learning methods. The dependence of a large amount of real-time data is strong, the calculation is complex, the model training effect is poor, the accuracy of route planning cannot be guaranteed, and therefore the overall task efficiency is affected.
Through the above analysis, the problems and defects existing in the prior art are as follows:
1. The prior art relies on a pre-constructed network topological graph, shows poor adaptability when facing the dynamic change of an unmanned aerial vehicle task network, and is difficult to adjust and optimize paths in real time. Network dynamics and link failures may significantly impact the effectiveness of the original message transmission path, thereby affecting efficient and accurate transmission of task messages.
2. The method has the advantages of complex calculation and poor model training effect, and the prior art is highly dependent on real-time data and complex route planning models. This dependency makes the system error-prone in case of insufficient data or poor data quality, resulting in an erroneous planning of the message transmission path. Under a high dynamic or complex environment, the training of the model has a dimension disaster, the training effect can be reduced, the effect of route planning decision is affected, and the method is not suitable for the requirements of unmanned aerial vehicle task network on high efficiency and accuracy.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, which use DQN in the deep reinforcement learning to carry out route planning and solve the problem of 'dimension disaster' caused by high dynamic performance of an unmanned aerial vehicle task network. The method realizes the double driving of the intention situation, and under the condition of limited network resources, in order to ensure the delivery of high-value task information, the transmission path suitable for the network information from the source node to the destination node is obtained by analyzing the QoS requirements and the situation information of links in the task intention of the task network of different unmanned aerial vehicles by using the DQN, so as to improve the transmission efficiency of the network information and reduce the delay and the cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the unmanned aerial vehicle task network message transmission route planning system based on deep reinforcement learning comprises a task intention translation module, a situation awareness module and a route planning module;
The task intention translation module obtains task intention key information from unmanned aerial vehicle service or unmanned aerial vehicle task input by a user through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain QoS demands of related task messages;
the situation awareness module collects key data of network situation information, performs centralized processing and management on the collected link delay, packet loss rate, bandwidth and load related data, comprehensively evaluates the situation of each path in the network, forms a resource situation database and provides a basis for a subsequent route planning module;
The routing planning module obtains a transmission path suitable for network information from a source node to a destination node by analyzing QoS requirements in different network task intents and situation information of network links and using deep reinforcement learning so as to improve the transmission efficiency of the network information and reduce delay and cost, and meanwhile, when the network situation or the node state changes, the routing planning module adjusts and optimizes the routing planning module so as to ensure that the task information is transmitted reliably in time.
The task intention translation module is divided into a knowledge extraction module and a demand mapping module, wherein the knowledge extraction module carries out key knowledge extraction on input intention;
the input intention can be identified through a named entity, key entity information in irregular Chinese intention is extracted, and translated intention tuples are output, so that normalized representation is realized;
And the demand mapping module performs network demand mapping on the extracted intention tuple in combination with a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands, and sends the result to the route planning module to provide basis for the route planning module.
Further, the situation awareness module comprises 6 links of data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;
the situation awareness module firstly collects network situation related data (parameters such as network bandwidth, time delay, packet loss rate, channel load and the like) in the network operation process, and data collection is realized;
Then, carrying out data processing and refinement characterization on the acquired data, writing the data into a data packet, and carrying out information transmission sharing and information fusion processing to form link situation information, so as to comprehensively evaluate the situation of each path;
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
Further, in the route planning module, a node set in the network, an intention tuple obtained by the task intention translation module, task message transmission QoS requirement information and link situation information obtained by the situation awareness module are used as initial inputs of the route planning module;
The route planning module searches an optimal transmission path in a plurality of reachable paths P (s, d) = { P 1,P2,P3,...,Pn } from a source node s to a destination node d of the network topology graph through a deep reinforcement learning algorithm according to the QoS requirement and link situation information of the current required transmission message, the obtained path meets the QoS requirement of the task message, and when the link situation changes, the transmission path which does not meet the QoS requirement is planned again in time through the algorithm, so that the real-time accurate transmission of the task message is ensured.
Further, the deep reinforcement learning algorithm in the routing planning module performs routing planning by using the DQN, selects transmission efficiency and communication quality as planning targets, wherein the transmission efficiency is measured by time delay, packet loss rate and bandwidth of message transmission, and the communication quality is measured by time delay, bandwidth, packet loss rate, load and jitter of a link;
The state space of the deep reinforcement learning algorithm comprises link situation information and task message demands, the action space comprises a next-hop node for message transmission, an intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, an effective transmission path is adaptively generated according to planning decision and model training, in order to obtain an optimal transmission path, the task intention demands are ensured to be met, the system needs to dynamically interact with a network environment, a group of state information is generated at each sampling interval when the state space is acquired, the state information quantity is huge due to high dynamic mobility of an unmanned plane, and a nonlinear approximation method of a neural network is selected to approximate a Q function, so that dimension disasters are solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.
Different experiences in the experience pool of the algorithm have different influences on the strategy learning of the intelligent agent, and a priority experience playback mechanism is introduced for fully utilizing high-quality experiences to improve model convergence efficiency. The priority of each empirical sample is determined based on the state estimation error (TD error) of the sample at each time step in the DQN algorithm. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
An unmanned aerial vehicle task network message transmission route planning method based on deep reinforcement learning comprises the following steps of;
S101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module;
s102, an unmanned aerial vehicle task realizes standardized representation through a task intention translation module and knowledge extraction through a knowledge extraction module, outputs intention as related task message requirement through requirement mapping of a strategy mapping module, and sends a result to a route planning module to provide basis for the task intention;
S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module;
S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P 1,P2,P3,...,Pn } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message;
and S105, if the network situation changes or the link fails, the adjustment and optimization are carried out, and the influence of the network dynamic change or the link failure on the network message transmission is reduced.
In S101, the user inputs the unmanned aerial vehicle task or service through the front-end interface, and then relaxes the unmanned aerial vehicle task or service to the task intention translation module for translation.
The step S102 specifically includes:
The user inputs task intention in a natural language form;
The first stage, the knowledge extraction module extracts key knowledge of input intents, if the given intents are complete and effective, key entity information in random Chinese intents can be extracted through named entity recognition, and translated intention tuples are output to realize normalized representation;
And in the second stage, the strategy mapping module performs network demand mapping on the extracted intention tuples in combination with the strategy knowledge graph and the network state knowledge graph to obtain related intention demands, and sends the result to the routing planning module for further configuration in S104.
The step S103 specifically includes:
the situation awareness module comprises 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;
The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, data collection is achieved, then data processing and refinement characterization are carried out on the collected data, and data packets are written into to carry out information transmission sharing and fusion processing to form link situation information;
The information fusion process is as follows:
Setting G= (V, E) to represent a network topology model, wherein V represents a set of all network nodes in the network model, E represents a set of links between network nodes, each link represents a direct path between two adjacent network nodes, setting a source node as s (s epsilon V), a destination node as d (d epsilon V), and having a plurality of transmission paths P (s, d) = { P 1,P2,P3,...,Pn }, wherein P i={ei1,ei2,...,eim }, and giving four QoS metric functions, namely a delay function delay (P), a bandwidth (P), a packet loss rate loss (P) and a load (P), to any one transmission path in the network model;
definition 1 (delay) for any transmission path P i from source node s to destination node d, the delay calculation formula is as follows:
Delay (e ik) is the time delay of each segment of link e ik on the transmission path P i;
definition 2 (bandwidth) for any transmission path P i from source node s to destination node d, the bandwidth calculation formula is as follows:
bandwidth(Pi)=min{bandwidth(eik),eik∈Pi},
wherein bandwidth (e ik) is the bandwidth of each segment of link e ik on transmission path P i;
definition 3 (packet loss rate) for any transmission path P i from the source node s to the destination node d, the packet loss rate calculation formula is as follows:
wherein loss (e ik) is a packet loss rate of each segment of link e ik on the transmission path P i;
Definition 4 (load) for any transmission path P i from source node s to destination node d, the load calculation formula is as follows:
Wherein Hop (P i) is the Hop count of path P i, load (e ik) is the load of each segment of link e ik on transmission path P i;
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
The step S104 specifically includes:
Performing route planning by using a Deep Q-Network (DQN) in Deep reinforcement learning, taking the problem of difficult route planning caused by various QoS requirements of unmanned aerial vehicle task Network task messages into consideration, selecting transmission efficiency Lr and communication quality Lq as unmanned aerial vehicle task Network message transmission route planning targets, wherein the transmission efficiency Lr is measured by time delay, packet loss rate and bandwidth of a message transmission path, the communication quality Lq is measured by time delay, bandwidth, packet loss rate and load of a link, and calculating objective functions L r and L q:
Lq=w1*delay(Pi)+w2*bandwidth(Pi)+w3*loss(Pi)+w4*load(Pi),
w1+w2+w3+w4=1,
Where loss (P i) represents a packet loss rate of a transmission path, bandwidth (P i) represents a bandwidth of the transmission path, delay (P i) represents a delay of the transmission path, load (P i) represents a load of the transmission path, and w 1,w2,w3,w4 represents a weight of link quality.
The state space contains link situation information and message transmission requirements, which are defined as O= [ O 1,o2,...,ok]T, wherein on=[delay(eik),bandwidth(eik),loss(eik),load(eik),D,B,Loss,Load]T is a set of situation information of each link e ik on a transmission path P i and message transmission requirements of current task intention, the action space contains next hop node information of message transmission, which is defined as a n=(next_noden), and the reward function R (s t,at) is defined as:
R(st,at)=α*Lr+β*Lq,
α+β=1,
Wherein α and β are weights of the transmission efficiency L r and the communication quality L q;
The intelligent agent autonomously perceives the situation of the surrounding environment to realize situation perception, and generates an effective transmission path in a self-adaptive manner according to planning decision and model training, so as to obtain an optimal transmission path, ensure that the task intention needs are met, the system needs to dynamically interact with the network environment, generate a group of state information at each sampling interval when acquiring a state space, select a nonlinear approximation method of a neural network to approximate a Q function so as to solve a dimensional disaster, and fit a state action value under a high-dimensional continuous state by training the CNN;
the Q function of node n at time t is expressed as Indicating that node n is in stateThe maximum long-term accumulated prize value obtainable by executing action a below, the Q value update equation is expressed as
Wherein α is the learning rate and γ is the discount factor;
to minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as
Wherein, the The weight coefficient of the predicted CNN of the node n;
target Q value is
Wherein, the Is the weight coefficient of the target CNN model;
The Q value at each moment is updated using a CNN consisting of an input layer, a convolutional layer (Conv), a flattening layer, a fully connected layer (FC) and an output layer, wherein Conv has 16 filters of 10 x 4 size and 2 steps, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as
Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority pj of each empirical sample j is determined from a state estimation error (TD error) δ j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta j being 0, a small positive number ζ is added to it
pj=|δj|+ζ
The playback probability p (j) of experience j is:
Wherein, alpha is a parameter for controlling priority, when alpha is 0, priority experience playback is changed into random sampling, a random priority sampling mode is used for relieving the loss of sampling diversity, a bias is introduced for biased sampling, a bias is corrected by using an important sampling weight w j for ensuring that a biased sampling learning strategy is the same as a uniform sampling strategy, the maximum weight value is used for normalization in consideration of stability, beta is an annealing factor, beta is annealed from an initial value to 1 at the end of learning and acts together with alpha to correct the bias, and n is the sampling number.
The method comprises the steps of training, selecting transmission paths randomly at the first K moments to fill data into an experience pool, selecting actions by an agent based on information of a state space by using a greedy strategy, continuously storing experience samples of interaction between the agent and the environment in the experience pool to enable the agent to fully explore the surrounding environment, and selecting priority weights randomly after being segmented according to experience sampling numbers in a learning stage, wherein an online decision result of CNN is the optimal transmission path of task information available at the current moment.
The specific steps of S105 are as follows:
When the network situation changes or the link fails and the transmission path no longer meets the QoS requirement of task message transmission, the DQN is used for carrying out routing planning again in combination with the current network situation information and adjusting and optimizing the transmission path so as to reduce the influence caused by the dynamic change of the network or the link failure and enable the network to have real-time decision making and optimizing capability.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing steps in a deep reinforcement learning based unmanned aerial vehicle task network message transmission route planning method when executing the computer program.
An information data processing system provides a machine learning environment, has self-learning and self-generating capabilities, and provides an intelligent foundation for the method.
The invention has the beneficial effects that:
The invention can realize deep understanding of the task intention of the unmanned aerial vehicle, automatically plan the message transmission path according to the need, realize accurate and efficient distribution of the task message of the unmanned aerial vehicle, adaptively adjust the task message distribution path of the unmanned aerial vehicle according to the network environment which changes in real time, improve the efficiency of the unmanned aerial vehicle task network and meet the high-performance requirement of the task under the dynamic environment. Meanwhile, management personnel or users can be helped to adjust network resource deployment in time, and therefore adaptability and reliability of the unmanned aerial vehicle task network are improved.
The present invention introduces a priority experience playback mechanism. The priority of each empirical sample is determined based on a state estimation error (TD error) for each time step. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
Drawings
FIG. 1 is a flow chart of the system of the present invention.
Fig. 2 is a specific implementation of the method of the present invention.
FIG. 3 is a flow chart showing the task intent translation module according to the present invention.
Fig. 4 is a flow chart of an embodiment of the situation awareness module of the present invention.
FIG. 5 is a flow chart of an embodiment of the deep reinforcement learning algorithm of the present invention.
Fig. 6 is a schematic diagram comparing the present invention with AODV and OLSR routing protocols.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
Aiming at the problems in the prior art, the invention provides an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the task intent translation module, the situation awareness module and the route planning module are included. The task intention translation module obtains task intention key information through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands. The situation awareness module collects key data of network situation information, performs centralized processing and management on the collected data, comprehensively evaluates the situation of each path in the network, and forms a resource situation database so as to provide basis for a subsequent route planning module. The routing planning module obtains a transmission path suitable for network messages from a source node to a destination node by analyzing QoS requirements and situation information of links in different network task intents and using a deep reinforcement learning algorithm so as to improve the transmission efficiency of the network messages and reduce delay and cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.
As shown in fig. 2, the method for planning a task network message transmission route of an unmanned aerial vehicle based on deep reinforcement learning according to the embodiment of the invention specifically includes the following implementation steps:
s101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module.
S102, the unmanned aerial vehicle task realizes standardization representation through a task intention translation module and knowledge extraction, outputs intention as related task message requirement through requirement mapping, and sends the result to a route planning module to provide basis for the task intention.
S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module.
S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P 1,P2,P3,...,Pn } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message.
And S105, if the network situation changes or the link fails, the adjustment and optimization are carried out, and the influence of the network dynamic change or the link failure on the network message transmission is reduced.
As shown in fig. 3, the specific implementation flow of the task intent translation module in S101 provided in the embodiment of the present invention includes:
The user inputs task intents in a natural language form, a knowledge extraction module extracts key knowledge of the input intents, if the given intents are complete and effective, the key entity information in the irregular Chinese intents can be extracted through named entity recognition, translated intention tuples are output to realize standardized characterization, and a strategy mapping module carries out network demand mapping on the intention tuples formed after extraction in combination with a strategy knowledge map and a network state knowledge map to obtain related intention demands, and the result is sent to a routing planning module for further configuration in S104.
As shown in fig. 4, a specific implementation flow of the situation awareness module in S103 provided by the embodiment of the present invention includes:
And 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion, and generation and maintenance of a situation database. The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, and then performs fine characterization on the collected data and writes the data into a data packet to perform transmission sharing and fusion processing to form link situation information.
The centralized fusion process of the situation information is as follows, setting g= (V, E) to represent a network topology model, wherein V represents a set of all network node components in the network model, E represents a set of link components between network nodes, each link representing a direct path between two adjacent network nodes. Let s (s e V) be the source node and d (d e V) be the destination node, have multiple transmission paths P (s, d) = { P 1,P2,P3,...,Pn }, where P i={ei1,ei2,...,eim }. Four QoS metric functions are given to any one transmission path in the network model, namely delay function delay (P), bandwidth (P), packet loss rate loss (P) and load (P).
Definition 1 (delay) for any transmission path P i from source node s to destination node d, the delay calculation formula is as follows:
definition 2 (bandwidth) for any transmission path P i from source node s to destination node d, the bandwidth calculation formula is as follows:
bandwidth(Pi)=min{bandwidth(eik),eik∈Pi}
definition 3 (packet loss rate) for any transmission path P i from the source node s to the destination node d, the packet loss rate calculation formula is as follows:
Definition 4 (load) for any transmission path P i from source node s to destination node d, hop count Hop (P) for the path, the load calculation formula is as follows:
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
As shown in fig. 5, the specific implementation flow of the S104 deep reinforcement learning algorithm provided in the embodiment of the present invention includes:
The Deep Q-Network (DQN) in Deep reinforcement learning is used for route planning, and the transmission efficiency L r and the communication quality L q are selected as unmanned aerial vehicle task Network message transmission route planning targets in consideration of the problem that the route planning is difficult due to various QoS requirements of unmanned aerial vehicle task Network task messages. The transmission efficiency L r is measured by the delay, the packet loss rate and the bandwidth of the message transmission, and the communication quality L q is measured by the delay, the bandwidth, the packet loss rate and the load of the link. The objective functions L r and L q are calculated:
Lq=w1*delay(Pi)+w2*bandwidth(Pi)+w3*loss(Pi)+w4*load(Pi)
w1+w2+w3+w4=1
The state space contains link situation information and message transmission requirements, defined as o= [ O 1,o2,...,ok]T, where on=[delay(eik),bandwidth(eik),loss(eik),load(eik),D,B,Loss,Load]T is the set of situation information and message transmission requirements for the current task intent for each segment of link e ik on transmission path P i. The action space contains the next hop node information for the message transmission, defined as a n=(next_noden). To improve the transmission efficiency and communication quality of task message transmission, the reward function is defined as:
R(st,at)=α*Lr+β*Lq
α+β=1
The intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, and an effective transmission path is adaptively generated according to planning decisions and model training. In order to obtain an optimal transmission path, it is ensured that task intention requirements are met, and the system needs to dynamically interact with a network environment. Each sampling interval generates a set of state information when acquiring a state space. And a nonlinear approximation method of the neural network is selected to approximate the Q function, so that the dimension disaster is solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.
The Q function of node n at time t is expressed asIndicating that a node is in stateThe maximum length accumulated prize value that can be achieved by executing action a down. The Q value update equation is expressed as
Where α is the learning rate and γ is the discount factor.
To minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as
Wherein, the The weighting factor of the predicted CNN for node n.
Target Q value is
Wherein, the Is the weighting coefficient of the target CNN model.
To update the Q value at each time instant, CNN composed of an input layer, a convolution layer (Conv), a flattening layer, a full connection layer (FC), and an output layer is used. Wherein Conv has 16 filters with a size of 10×4 and a stride of 2, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as
Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority p j of each empirical sample j is determined from the state estimation error (TD error) delta j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta j being 0, a small positive number ζ is added to it
pj=|δj|+ζ
The playback probability p (j) of experience j is:
Where α is a parameter controlling priority, when α is 0, the priority empirical playback becomes random sampling. To mitigate the loss of sample diversity, a random priority sampling approach is used. Since the biased sampling introduces the bias, the importance sampling weight w j is used to correct the bias in order to ensure that the biased sampling learning strategy is the same as the uniform sampling strategy. The normalization is performed using the maximum weight value in consideration of stability. Beta is an annealing factor that will anneal from an initial value to 1 at the end of learning, acting together with alpha to correct the bias, n is the number of samples.
In the training process, transmission paths are randomly selected from the first K moments, and data are filled into an experience pool. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the intelligent agent can fully explore the surrounding environment. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
The technical effects of the present invention will be described in detail with reference to simulation.
The simulation adopts a modularized design, constructs an unmanned aerial vehicle task network simulation platform, simulates in a QT 5.15.2 platform, and uses Pycharm to realize intelligent route planning based on a deep reinforcement learning algorithm. Through interaction of the two, task message QoS requirements and network situation information are fully considered, dynamic self-adaptive route planning is realized, and the dynamic self-adaptive route planning is compared with AODV and OLSR routing protocols.
As shown in fig. 6, the average route generation time of different task messages in the same network scene is counted, and the average route generation time of the route planning method of the invention is relatively stable and basically controlled within 1.2 seconds. The method can rapidly plan the routing strategy in the dynamic unmanned aerial vehicle task network environment, and has good instantaneity. In contrast, the average route generation time of AODV and OLSR routing protocols fluctuates greatly.
1. For the problem of poor dynamic adaptation, the invention introduces an intention driving network, translates unmanned aerial vehicle tasks into message QoS requirements and generates a routing strategy based on a task intention translation module and a situation awareness module. And then continuously monitoring the network state in real time, and when the generated routing strategy no longer meets the QoS requirement of the message due to the change of the network situation or the link fault, re-performing the routing planning to adjust and optimize the routing strategy so as to reduce the influence caused by the dynamic change of the network or the link fault and improve the dynamic adaptability of the routing planning.
2. Aiming at the problems of complex calculation and poor model training effect, the invention uses deep reinforcement learning to carry out intelligent route planning, combines the neural network into an intelligent agent in a reinforcement learning framework, solves the high-dimensional decision problem of the unmanned aerial vehicle task network, ensures the accuracy of route planning, improves the network message transmission efficiency, and reduces delay and cost.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portions may be implemented using dedicated logic and the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or dedicated design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (10)

1.基于深度强化学习的无人机任务网络消息传输路由规划系统,其特征在于,包括任务意图转译模块、态势感知模块和路由规划模块;1. A UAV mission network message transmission routing planning system based on deep reinforcement learning, characterized by including a mission intent translation module, a situational awareness module, and a routing planning module; 所述任务意图转译模块通过知识抽取技术从用户输入的无人机业务或无人机任务中得到任务意图关键信息,并结合策略知识图谱和网络态势知识图谱进行网络需求映射,得到相关任务消息的QoS需求;The mission intent translation module obtains key mission intent information from the drone business or drone mission input by the user through knowledge extraction technology, and combines the strategy knowledge graph and network situation knowledge graph to perform network demand mapping to obtain the QoS requirements of related mission messages; 所述态势感知模块采集网络态势信息的关键数据,并对采集到的链路时延、丢包率、带宽和负载相关数据进行集中式处理和管理,综合评估网络中每条路径的情况,并形成资源态势数据库,为后续的路由规划模块提供依据;The situation awareness module collects key data of network situation information and centrally processes and manages the collected link delay, packet loss rate, bandwidth and load-related data, comprehensively evaluates the situation of each path in the network, and forms a resource situation database to provide a basis for the subsequent routing planning module; 所述路由规划模块通过分析不同网络任务意图中的QoS需求和网络链路的态势信息,使用深度强化学习得到适合网络消息从源节点到目的节点的传输路径,以提高网络消息传输效率、降低延迟和成本;同时当网络态势或者节点状态发生变化时进行调整和优化,保证任务消息传输。The routing planning module analyzes the QoS requirements and network link status information in different network task intentions, and uses deep reinforcement learning to obtain a transmission path suitable for network messages from the source node to the destination node, thereby improving network message transmission efficiency, reducing latency and costs; at the same time, it adjusts and optimizes when the network situation or node status changes to ensure the transmission of task messages. 2.根据权利要求1所述的基于深度强化学习的无人机任务网络消息传输路由规划系统,其特征在于,所述任务意图转译模块中,分为知识抽取模块和需求映射两个子模块;所述知识抽取模块对输入意图进行关键知识抽取;2. The deep reinforcement learning-based UAV mission network message transmission routing planning system according to claim 1 is characterized in that the mission intent translation module is divided into two submodules: a knowledge extraction module and a requirement mapping module; the knowledge extraction module extracts key knowledge from the input intent; 输入的意图经过命名实体识别,提取无规则中文意图中的关键实体信息,并输出转译后的意图元组,实现规范化表征;The input intent undergoes named entity recognition to extract key entity information from irregular Chinese intents and outputs the translated intent tuple to achieve standardized representation. 需求映射模块将抽取后形成的所述意图元组结合策略知识图谱和网络态势知识图谱进行网络需求映射,得到相关任务消息需求,并将结果发送至路由规划模块为其提供依据。The demand mapping module combines the extracted intention tuples with the strategy knowledge graph and the network situation knowledge graph to perform network demand mapping, obtains relevant task message requirements, and sends the results to the routing planning module to provide a basis for it. 3.根据权利要求1所述的基于深度强化学习的无人机任务网络消息传输路由规划系统,其特征在于,所述态势感知模块执行数据采集、数据处理与细化表征、信息传输、信息融合、态势数据库的生成与维护6个环节;3. The UAV mission network message transmission routing planning system based on deep reinforcement learning according to claim 1 is characterized in that the situational awareness module performs six steps: data acquisition, data processing and detailed representation, information transmission, information fusion, and generation and maintenance of the situation database; 态势感知模块首先实现网络运行过程中链路时延、丢包率、带宽和负载相关数据的采集,实现数据采集;The situational awareness module first realizes the collection of data related to link delay, packet loss rate, bandwidth and load during network operation to achieve data collection; 之后对采集到的数据进行数据处理与细化表征,并写入数据包进行信息传输共享与信息融合处理,形成链路态势信息,目的是综合评估每条路径的情况;The collected data is then processed and refined, and written into data packets for information transmission, sharing, and fusion processing to form link situation information, with the goal of comprehensively evaluating the situation of each path. 最后生成态势数据库并进行动态维护,而数据库中的链路态势信息是后续路由规划模块的数据基础。Finally, a situation database is generated and dynamically maintained, and the link situation information in the database is the data basis for the subsequent routing planning module. 4.根据权利要求1所述的基于深度强化学习的无人机任务网络消息传输路由规划系统,其特征在于,所述路由规划模块中,将网络中的节点集合、任务意图转译模块得到的意图元组、任务消息传输QoS需求信息和态势感知模块得到的链路态势信息作为路由规划模块的初始输入;4. The deep reinforcement learning-based UAV mission network message transmission routing planning system according to claim 1, characterized in that the routing planning module uses the node set in the network, the intent tuple obtained by the mission intent translation module, the mission message transmission QoS requirement information, and the link situation information obtained by the situation awareness module as initial inputs to the routing planning module; 路由规划模块依据当前所需传输消息的QoS需求和链路态势信息,通过深度强化学习算法在网络拓扑图的源节点s到目的节点d的多条可达路径P(s,d)={P1,P2,P3,...,Pn}中寻找最优的传输路径,所得路径满足任务消息的QoS需求,并且在链路态势发生变化时对不满足QoS需求的传输路径重新进行规划,确保任务消息实时准确传输。The routing planning module uses a deep reinforcement learning algorithm to search for the optimal transmission path among multiple reachable paths P(s,d)={ P1 , P2 , P3 , ..., Pn } from the source node s to the destination node d in the network topology graph based on the QoS requirements of the current message transmission and link status information. The obtained path meets the QoS requirements of the task message and replans the transmission path that does not meet the QoS requirements when the link status changes, ensuring the real-time and accurate transmission of the task message. 5.根据权利要求1-4任一项所述系统的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,包括以下步骤;5. A method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to the system of any one of claims 1 to 4, characterized in that it comprises the following steps: S101,用户输入无人机任务,发送至任务意图转译模块;S101, the user inputs the drone mission and sends it to the mission intention translation module; S102,无人机任务通过任务意图转译模块,经过知识抽取模块的知识抽取实现规范化表征,经过策略映射模块的需求映射将意图输出为相关任务消息需求,并将结果发送至路由规划模块为其提供依据;S102: The UAV mission is translated into a mission intent by the mission intent translation module, and the knowledge extraction module extracts knowledge to achieve a standardized representation. The strategy mapping module outputs the intent as a relevant mission message requirement, and the result is sent to the routing planning module to provide a basis for it. S103,态势感知模块采集网络态势信息的关键数据并进行处理与管理形成资源态势数据库,为路由规划模块为其提供数据基础;S103, the situation awareness module collects key data of network situation information and processes and manages it to form a resource situation database, providing a data basis for the routing planning module; S104,路由规划模块依据当前任务所需传输消息的QoS需求和链路态势信息,通过深度强化学习算法在网络拓扑图的源节点s到目的节点d的多条可达路径P(s,d)={P1,P2,P3,...,Pn}中寻找最优的传输路径,所得路径满足任务消息的QoS需求;S104: Based on the QoS requirements of the message transmission required by the current task and the link status information, the routing planning module uses a deep reinforcement learning algorithm to search for the optimal transmission path from the multiple reachable paths P(s,d) = { P1 , P2 , P3 , ..., Pn } from the source node s to the destination node d in the network topology graph. The resulting path meets the QoS requirements of the task message. S105,若网络态势发生变化或链路出现故障时进行调整和优化,降低网络动态变化或链路故障对网络消息传输的影响。S105 , if the network situation changes or a link fails, adjustments and optimizations are performed to reduce the impact of the network dynamic changes or link failures on network message transmission. 6.根据权利要求5所述的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,所述S101中用户通过前端界面输入无人机任务或业务,然后将其放松至任务意图转译模块进行转译处理。6. A method for network message transmission routing planning for UAV missions based on deep reinforcement learning according to claim 5, characterized in that in S101, the user inputs the UAV mission or business through the front-end interface, and then sends it to the mission intent translation module for translation processing. 7.根据权利要求5所述的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,所述S102具体为:7. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 5, wherein S102 specifically comprises: 用户以自然语言形式输入任务意图;The user inputs the task intention in natural language; 第一阶段,知识抽取模块对输入意图进行关键知识抽取,若给定的意图是完整且有效的,则可经过命名实体识别,提取无规则中文意图中的关键实体信息,并输出转译后的意图元组,实现规范化表征;In the first stage, the knowledge extraction module extracts key knowledge from the input intent. If the given intent is complete and valid, named entity recognition is performed to extract key entity information from the irregular Chinese intent and output the translated intent tuple to achieve standardized representation. 第二阶段,策略映射模块将抽取后形成的意图元组结合策略知识图谱和网络状态知识图谱进行网络需求映射,得到相关意图需求,并将结果发送至S104中路由规划模块进一步配置。In the second stage, the strategy mapping module combines the extracted intent tuples with the strategy knowledge graph and the network status knowledge graph to map the network requirements, obtain relevant intent requirements, and send the results to the routing planning module in S104 for further configuration. 8.根据权利要求5所述的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,所述S103具体为:8. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 5, wherein S103 specifically comprises: 态势感知模块包括6个环节,即数据采集、数据处理与细化表征、信息传输、信息融合、态势数据库的生成与维护;The situation awareness module includes six links, namely data collection, data processing and detailed representation, information transmission, information fusion, and generation and maintenance of situation database; 态势感知模块首先采集网络运行过程中网络态势相关数据的采集,实现数据采集,实现数据采集,之后对采集到的数据进行数据处理与细化表征并写入数据包进行信息传输共享与融合处理形成链路态势信息;The situation awareness module first collects data related to the network situation during network operation, realizes data collection, and then processes and refines the collected data and writes it into data packets for information transmission, sharing and fusion processing to form link situation information; 信息融合处理如下:The information fusion process is as follows: 设定G=(V,E)表示网络拓扑模型,其中V表示网络模型中所有网络节点组成的集合,E表示网络节点之间的链路组成的集合,每一条链路表示两个相邻网络节点之间的直达路径;设定源节点为s(s∈V),目的节点为d(d∈V),拥有多条传输路径P(s,d)={P1,P2,P3,...,Pn},其中Pi={ei1,ei2,...,eim};对于网络模型中的任意一条传输路径,给出四种QoS度量函数,分别为时延函数delay(P)、带宽bandwidth(P)、丢包率loss(P)、负载load(P);Assume G = (V, E) to represent the network topology model, where V represents the set of all network nodes in the network model, E represents the set of links between network nodes, and each link represents a direct path between two adjacent network nodes. Assume the source node is s (s∈V) and the destination node is d (d∈V), with multiple transmission paths P(s, d) = {P 1 ,P 2 ,P 3 ,...,P n }, where P i = {e i1 ,e i2 ,...,e im }. For any transmission path in the network model, four QoS metric functions are given: delay (P), bandwidth (P), packet loss (P), and load (P). 时延定义1:对于任意从源节点s至目的节点d的传输路径Pi,时延计算公式如下:Delay Definition 1: For any transmission path P i from source node s to destination node d, the delay is calculated as follows: 其中,delay(eik)为传输路径Pi上每段链路eik的时延;Where, delay(e ik ) is the delay of each link e ik on the transmission path P i ; 带宽定义2:对于任意从源节点s至目的节点d的传输路径Pi,带宽计算公式如下:Bandwidth Definition 2: For any transmission path P i from source node s to destination node d, the bandwidth is calculated as follows: bandwidth(Pi)=min{bandwidth(eik),eik∈Pi},bandwidth(P i )=min{bandwidth(e ik ),e ik ∈P i }, 其中,bandwidth(eik)为传输路径Pi上每段链路eik的带宽;Where, bandwidth(e ik ) is the bandwidth of each link e ik on the transmission path P i ; 丢包率定义3:对于任意从源节点s至目的节点d的传输路径Pi,丢包率计算公式如下:Packet loss rate definition 3: For any transmission path P i from source node s to destination node d, the packet loss rate is calculated as follows: 其中,loss(eik)为传输路径Pi上每段链路eik的丢包率;Where loss(e ik ) is the packet loss rate of each link e ik on the transmission path P i ; 负载定义4:对于任意从源节点s至目的节点d的传输路径Pi,负载计算公式如下:Load Definition 4: For any transmission path P i from source node s to destination node d, the load calculation formula is as follows: 其中,Hop(Pi)为路径Pi的跳数,load(eik)为传输路径Pi上每段链路eik的负载;Where Hop(P i ) is the number of hops of path P i , load(e ik ) is the load of each link e ik on the transmission path P i ; 最后生成态势数据库并进行动态维护,而数据库中的链路态势信息是后续路由规划模块的数据基础。Finally, a situation database is generated and dynamically maintained, and the link situation information in the database is the data basis for the subsequent routing planning module. 9.根据权利要求8所述的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,所述S104具体为:9. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 8, wherein S104 specifically comprises: 使用深度强化学习中的深度Q网络进行路由规划,选择传输效率Lr和通信质量Lq作为无人机任务网络消息传输路由规划目标;其中,传输效率Lr通过消息传输路径的时延、丢包率和带宽来衡量,通信质量Lq通过链路的时延、带宽、丢包率和负载来衡量;计算目标函数Lr和LqUsing the deep Q-network in deep reinforcement learning for route planning, we select transmission efficiency L r and communication quality L q as the objectives of message transmission routing planning for the UAV mission network. Transmission efficiency L r is measured by the message transmission path delay, packet loss rate, and bandwidth, and communication quality L q is measured by the link delay, bandwidth, packet loss rate, and load. Calculate the objective functions L r and L q as follows: Lq=w1*delay(Pi)+w2*bandwidth(Pi)+w3*loss(Pi)+w4*load(Pi),L q =w 1 *delay(P i )+w 2 *bandwidth(P i )+w 3 *loss(P i )+w 4 *load(P i ), w1+w2+w3+w4=1,w 1 +w 2 +w 3 +w 4 = 1, 其中,loss(Pi)表示传输路径的丢包率,bandwidth(Pi)表示传输路径的带宽,delay(Pi)表示传输路径的时延,load(Pi)表示传输路径的负载,w1,w2,w3,w4表示链路质量的权重;Where loss(P i ) represents the packet loss rate of the transmission path, bandwidth(P i ) represents the bandwidth of the transmission path, delay(P i ) represents the delay of the transmission path, load(P i ) represents the load of the transmission path, w 1 ,w 2 ,w 3 ,w 4 represent the weights of the link quality; 状态空间包含链路态势信息和消息传输需求,定义为O=[o1,o2,...,ok]T,其中on=[delay(eik),bandwidth(eik),loss(eik),load(eik),D,B,Loss,Load]T是传输路径Pi上每段链路eik的态势信息和当前任务意图的消息传输需求的集合;动作空间包含消息传输的下一跳节点信息,定义为an=(next_noden);奖励函数R(st,at)定义为:The state space contains link status information and message transmission requirements, and is defined as O = [o 1 ,o 2 ,…, ok ] T , where o n = [delay(e ik ), bandwidth(e ik ), loss(e ik ), load(e ik ), D, B, Loss, Load] T is the set of status information of each link e ik on the transmission path P i and the message transmission requirements of the current task intention; the action space contains the next hop node information of the message transmission, and is defined as a n = (next_node n ); the reward function R(s t ,a t ) is defined as: R(st,at)=α*Lr+β*Lq,R(s t ,a t )=α*L r +β*L q , α+β=1,α+β=1, 其中,α和β为传输效率Lr和通信质量Lq的权重;Among them, α and β are the weights of transmission efficiency L r and communication quality L q ; 智能体自主感知周围环境态势实现态势感知,根据规划决策和模型训练,自适应生成有效的传输路径;在获取状态空间时,每个采样间隔会产生一组状态信息;并选择神经网络的非线性逼近方法来逼近Q函数,每个智能节点有一个具有相同结构的卷积神经网络,通过训练CNN,拟合高维连续状态下的状态动作值;The intelligent agent autonomously perceives the surrounding environment to achieve situational awareness and adaptively generates effective transmission paths based on planning decisions and model training. When acquiring the state space, each sampling interval generates a set of state information. A nonlinear approximation method of a neural network is selected to approximate the Q function. Each intelligent node has a convolutional neural network with the same structure. By training the CNN, the state action value under the high-dimensional continuous state is fitted. 节点n在时刻t的Q函数表示为表示节点n在状态下执行动作a能获得的最大长期积累奖励值,Q值更新方程表示为The Q function of node n at time t is expressed as Indicates that node n is in state The maximum long-term accumulated reward value that can be obtained by performing action a under the condition of , the Q value update equation is expressed as 其中,α为学习率,γ是折扣因子;Among them, α is the learning rate and γ is the discount factor; 为最小化估计Q值与目标Q值之间的差距,将DQN的损失函数定义为To minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as 其中,为节点n的预测CNN的权重系数;in, is the weight coefficient of the predicted CNN of node n; 目标Q值为The target Q value is 其中,是目标CNN模型的权重系数;in, is the weight coefficient of the target CNN model; 使用由输入层、卷积层(Conv)、展平层、全连接层(FC)以及输出层组成的CNN更新每个时刻的Q值,使用tanh作为激活函数;FC激活函数为sigmoid;输出层的激活函数为线性激活函数;CNN的训练过程采用梯度下降法,则损失函数的梯度表示为The CNN consisting of input layer, convolution layer (Conv), flattening layer, fully connected layer (FC) and output layer is used to update the Q value at each moment. Tanh is used as the activation function; the FC activation function is sigmoid; the activation function of the output layer is a linear activation function; the CNN training process uses the gradient descent method, and the gradient of the loss function is expressed as 每个经验样本j的优先级pj根据每个时间步的状态估计误差δj确定,状态估计误差大的经验将被更频繁地回放;添加一个很小的正数ζ避免因δj为0导致经验样本j无法被采样The priority pj of each experience sample j is determined by the state estimation error δj at each time step. Experiences with large state estimation errors will be replayed more frequently. A small positive number ζ is added to avoid the failure of experience sample j to be sampled due to δj being 0. pj=|δj|+ζp j = |δ j |+ζ 经验j的回放概率p(j)为:The replay probability p(j) of experience j is: 其中,α为控制优先性的参数,当α为0时,优先经验回放变为随机采样;使用随机优先级采样的方式缓解采样多样性的损失;使用重要性抽样权重wj纠正偏差;使用最大权重值进行归一化;β为退火因子,在学习结束时β将从初始值退火到1,与α共同作用以校正偏差,n为采样数;Among them, α is the parameter that controls priority. When α is 0, the priority experience replay becomes random sampling. Random priority sampling is used to alleviate the loss of sampling diversity. Importance sampling weights wj are used to correct bias. The maximum weight value is used for normalization. β is the annealing factor. At the end of learning, β will anneal from the initial value to 1 and work together with α to correct bias. n is the number of samples. 训练过程中,前K个时刻随机选择传输路径,为经验池填充数据;之后智能体基于状态空间的信息,使用贪婪策略选取动作,并不断在经验池存储智能体与环境交互的经验样本,使智能体能够充分地探索周围环境;CNN的在线决策结果即为当前时刻可得到的任务消息最优传输路径,在学习阶段,优先级权重按照经验采样数分段后随机选取,并选取该权重对应的经验。During the training process, transmission paths are randomly selected in the first K moments to fill the experience pool with data. Afterwards, the agent uses a greedy strategy to select actions based on the information in the state space and continuously stores experience samples of the agent's interaction with the environment in the experience pool, allowing the agent to fully explore the surrounding environment. The CNN's online decision result is the optimal transmission path for the task message available at the current moment. During the learning phase, priority weights are randomly selected after being segmented according to the number of experience samples, and the experience corresponding to the weight is selected. 10.根据权利要求9所述的一种基于深度强化学习的无人机任务网络消息传输路由规划方法,其特征在于,所述S105的具体步骤是:10. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 9, wherein the specific steps of S105 are: 对网络状态进行实时监测,当网络态势出现变化或链路出现故障导致传输路径不再满足任务消息传输的QoS需求时,使用DQN结合当前网络态势信息重新进行路由规划并对传输路径进行调整和优化,以降低网络动态变化或链路故障造成的影响,使网络具备实时决策和优化能力。The network status is monitored in real time. When the network situation changes or a link fails, causing the transmission path to no longer meet the QoS requirements of task message transmission, DQN is used to re-plan the route and adjust and optimize the transmission path based on the current network situation information to reduce the impact of dynamic network changes or link failures, enabling the network to have real-time decision-making and optimization capabilities.
CN202511026371.1A 2025-07-24 2025-07-24 UAV mission network message transmission routing planning system and method based on deep reinforcement learning Pending CN120639690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511026371.1A CN120639690A (en) 2025-07-24 2025-07-24 UAV mission network message transmission routing planning system and method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511026371.1A CN120639690A (en) 2025-07-24 2025-07-24 UAV mission network message transmission routing planning system and method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN120639690A true CN120639690A (en) 2025-09-12

Family

ID=96973234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511026371.1A Pending CN120639690A (en) 2025-07-24 2025-07-24 UAV mission network message transmission routing planning system and method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN120639690A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120722758A (en) * 2025-08-27 2025-09-30 中国人民解放军国防科技大学 Explainable reinforcement learning decision system and method based on large language model enhancement

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120722758A (en) * 2025-08-27 2025-09-30 中国人民解放军国防科技大学 Explainable reinforcement learning decision system and method based on large language model enhancement

Similar Documents

Publication Publication Date Title
Liu et al. Deep reinforcement learning for communication flow control in wireless mesh networks
CN111010294B (en) Electric power communication network routing method based on deep reinforcement learning
Hou et al. Reliable computation offloading for edge-computing-enabled software-defined IoV
Qi et al. Knowledge-driven service offloading decision for vehicular edge computing: A deep reinforcement learning approach
CN109990790B (en) Method and device for path planning of unmanned aerial vehicle
CN116938810A (en) A deep reinforcement learning SDN intelligent routing optimization method based on graph neural network
CN114221691A (en) Software-defined air-space-ground integrated network route optimization method based on deep reinforcement learning
CN117896306A (en) A computing power routing method and system based on deep reinforcement learning and graph neural network
Paul et al. Digital twin-aided vehicular edge network: A large-scale model optimization by quantum-DRL
Jarwan et al. Edge-based federated deep reinforcement learning for IoT traffic management
Liu et al. Green mobility management in UAV-assisted IoT based on dueling DQN
WO2022028926A1 (en) Offline simulation-to-reality transfer for reinforcement learning
Ren et al. End-to-end network SLA quality assurance for C-RAN: A closed-loop management method based on digital twin network
CN120639690A (en) UAV mission network message transmission routing planning system and method based on deep reinforcement learning
Jin et al. A congestion control method of SDN data center based on reinforcement learning
CN116599904A (en) Parallel transmission load balancing device and method
Tao et al. Generative AI-aided vertical handover decision in SAGIN for IoT with integrated sensing and communication
Liu et al. MoEI: Mobility-aware edge inference based on model partition and service migration
Yang et al. Knowledge-defined edge computing networks assisted long-term optimization of computation offloading and resource allocation strategy
KR102537023B1 (en) Method for controlling network traffic based traffic analysis using AI(artificial intelligence) and apparatus for performing the method
Yun et al. Remote estimation for dynamic IoT sources under sublinear communication costs
Li et al. H-BILSTM: a novel bidirectional long short term memory network based intelligent early warning scheme in mobile edge computing (MEC)
CN114520991B (en) Unmanned aerial vehicle cluster-based edge network self-adaptive deployment method
Xiong et al. Deep learning traffic prediction to optimize routing paths and reduce latency in SDN
Meng et al. Intelligent routing orchestration for ultra-low latency transport networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination