CN120639690A

CN120639690A - UAV mission network message transmission routing planning system and method based on deep reinforcement learning

Info

Publication number: CN120639690A
Application number: CN202511026371.1A
Authority: CN
Inventors: 杨春刚; 吴涵; 李彤; 李紫璇; 李�杰
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2025-07-24
Filing date: 2025-07-24
Publication date: 2025-09-12

Abstract

A deep reinforcement learning-based network message transmission routing planning system and method for drone missions includes the following steps: S101: A user inputs a drone mission and sends it to a mission intent translation module; S102: The drone mission is extracted and standardized through knowledge extraction in the mission intent translation module, and the intent is output as relevant mission message requirements through demand mapping in the strategy mapping module; S103: The situation awareness module collects key data from network situation information and processes and manages it to form a resource situation database; S104: The routing planning module searches for the optimal transmission path based on the QoS requirements and link situation information of the message transmission required for the current mission; S105: Adjustments and optimizations are performed if the network situation changes or a link fails. This invention improves network message transmission efficiency, reduces latency, and reduces costs. Furthermore, adjustments and optimizations are performed when the network situation or node status changes, ensuring timely and reliable transmission of mission messages.

Description

Unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle task network communication, and particularly relates to an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning.

Background

Rapid advances in aviation, electronic, automatic control and artificial intelligence technology have facilitated rapid growth of unmanned aerial vehicle technology. Unmanned aerial vehicles are widely used in various civil and military fields, such as real-time surveillance, search and rescue, military reconnaissance, and hazardous location inspection. With the development of unmanned aerial vehicle technology and the advent of numerous miniaturized unmanned aerial vehicles, unmanned aerial vehicle applications are increasingly tending to perform work with multiple unmanned aerial vehicles or clusters. The cooperating unmanned aerial vehicles communicate using wireless communication devices and are dynamically networked in an ad hoc manner. Along with the increasingly strict task requirements and the increasingly complex environment, the number of nodes required by the unmanned aerial vehicle ad hoc network has greatly increased, and the characteristics of large network scale and extremely strong dynamic property are formed. The method is mainly characterized in that the number of messages transmitted between nodes is increased sharply, so that the communication burden of a network is increased, the routing overhead is increased remarkably, the efficiency and accuracy of the unmanned aerial vehicle for executing tasks are affected undoubtedly, and meanwhile, higher requirements are put on the high efficiency and dynamic adaptability of a routing algorithm, and the method aims at adapting to a rapidly-changing environment.

Through the above analysis, the problems and defects existing in the prior art are as follows:

(1) The prior art-route planning is performed by a traditional routing algorithm based on static network topology. The traditional routing algorithm mainly depends on a pre-constructed network topological graph, and performs routing selection according to network situation parameters, so that an optimal transmission path can be effectively selected for task messages, and the method has higher reliability and efficiency in a stable network environment. However, the conventional routing algorithm has significant limitations and insufficient adaptability to complex environment changes. Under the scene of unmanned aerial vehicle task network, the topological structure and the communication link moment of network change, and traditional routing algorithm is difficult to effectively catch these changes, can't realize dynamic self-adaptation, and then has influenced task message transmission's reliability and timeliness.

(2) In the second prior art, the network behavior and the response mode are continuously learned through a reinforcement learning algorithm so as to carry out message transmission route planning. When encountering dynamic change of the network or link failure, the message transmission path can be adjusted and optimized in real time, so that the task message can be reliably and timely transmitted, and the influence of the network state change on the message transmission is reduced. However, a limitation of this technique is that the high dynamics of the unmanned mission network results in a high level of complexity in its state, which is difficult to solve with conventional reinforcement learning methods. The dependence of a large amount of real-time data is strong, the calculation is complex, the model training effect is poor, the accuracy of route planning cannot be guaranteed, and therefore the overall task efficiency is affected.

1. The prior art relies on a pre-constructed network topological graph, shows poor adaptability when facing the dynamic change of an unmanned aerial vehicle task network, and is difficult to adjust and optimize paths in real time. Network dynamics and link failures may significantly impact the effectiveness of the original message transmission path, thereby affecting efficient and accurate transmission of task messages.

2. The method has the advantages of complex calculation and poor model training effect, and the prior art is highly dependent on real-time data and complex route planning models. This dependency makes the system error-prone in case of insufficient data or poor data quality, resulting in an erroneous planning of the message transmission path. Under a high dynamic or complex environment, the training of the model has a dimension disaster, the training effect can be reduced, the effect of route planning decision is affected, and the method is not suitable for the requirements of unmanned aerial vehicle task network on high efficiency and accuracy.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, which use DQN in the deep reinforcement learning to carry out route planning and solve the problem of 'dimension disaster' caused by high dynamic performance of an unmanned aerial vehicle task network. The method realizes the double driving of the intention situation, and under the condition of limited network resources, in order to ensure the delivery of high-value task information, the transmission path suitable for the network information from the source node to the destination node is obtained by analyzing the QoS requirements and the situation information of links in the task intention of the task network of different unmanned aerial vehicles by using the DQN, so as to improve the transmission efficiency of the network information and reduce the delay and the cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the unmanned aerial vehicle task network message transmission route planning system based on deep reinforcement learning comprises a task intention translation module, a situation awareness module and a route planning module;

The task intention translation module obtains task intention key information from unmanned aerial vehicle service or unmanned aerial vehicle task input by a user through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain QoS demands of related task messages;

the situation awareness module collects key data of network situation information, performs centralized processing and management on the collected link delay, packet loss rate, bandwidth and load related data, comprehensively evaluates the situation of each path in the network, forms a resource situation database and provides a basis for a subsequent route planning module;

The routing planning module obtains a transmission path suitable for network information from a source node to a destination node by analyzing QoS requirements in different network task intents and situation information of network links and using deep reinforcement learning so as to improve the transmission efficiency of the network information and reduce delay and cost, and meanwhile, when the network situation or the node state changes, the routing planning module adjusts and optimizes the routing planning module so as to ensure that the task information is transmitted reliably in time.

The task intention translation module is divided into a knowledge extraction module and a demand mapping module, wherein the knowledge extraction module carries out key knowledge extraction on input intention;

the input intention can be identified through a named entity, key entity information in irregular Chinese intention is extracted, and translated intention tuples are output, so that normalized representation is realized;

And the demand mapping module performs network demand mapping on the extracted intention tuple in combination with a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands, and sends the result to the route planning module to provide basis for the route planning module.

Further, the situation awareness module comprises 6 links of data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;

the situation awareness module firstly collects network situation related data (parameters such as network bandwidth, time delay, packet loss rate, channel load and the like) in the network operation process, and data collection is realized;

Then, carrying out data processing and refinement characterization on the acquired data, writing the data into a data packet, and carrying out information transmission sharing and information fusion processing to form link situation information, so as to comprehensively evaluate the situation of each path;

And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.

Further, in the route planning module, a node set in the network, an intention tuple obtained by the task intention translation module, task message transmission QoS requirement information and link situation information obtained by the situation awareness module are used as initial inputs of the route planning module;

The route planning module searches an optimal transmission path in a plurality of reachable paths P (s, d) = { P ₁,P₂,P₃,...,P_n } from a source node s to a destination node d of the network topology graph through a deep reinforcement learning algorithm according to the QoS requirement and link situation information of the current required transmission message, the obtained path meets the QoS requirement of the task message, and when the link situation changes, the transmission path which does not meet the QoS requirement is planned again in time through the algorithm, so that the real-time accurate transmission of the task message is ensured.

Further, the deep reinforcement learning algorithm in the routing planning module performs routing planning by using the DQN, selects transmission efficiency and communication quality as planning targets, wherein the transmission efficiency is measured by time delay, packet loss rate and bandwidth of message transmission, and the communication quality is measured by time delay, bandwidth, packet loss rate, load and jitter of a link;

The state space of the deep reinforcement learning algorithm comprises link situation information and task message demands, the action space comprises a next-hop node for message transmission, an intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, an effective transmission path is adaptively generated according to planning decision and model training, in order to obtain an optimal transmission path, the task intention demands are ensured to be met, the system needs to dynamically interact with a network environment, a group of state information is generated at each sampling interval when the state space is acquired, the state information quantity is huge due to high dynamic mobility of an unmanned plane, and a nonlinear approximation method of a neural network is selected to approximate a Q function, so that dimension disasters are solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.

Different experiences in the experience pool of the algorithm have different influences on the strategy learning of the intelligent agent, and a priority experience playback mechanism is introduced for fully utilizing high-quality experiences to improve model convergence efficiency. The priority of each empirical sample is determined based on the state estimation error (TD error) of the sample at each time step in the DQN algorithm. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.

An unmanned aerial vehicle task network message transmission route planning method based on deep reinforcement learning comprises the following steps of;

S101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module;

s102, an unmanned aerial vehicle task realizes standardized representation through a task intention translation module and knowledge extraction through a knowledge extraction module, outputs intention as related task message requirement through requirement mapping of a strategy mapping module, and sends a result to a route planning module to provide basis for the task intention;

S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module;

S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P ₁,P₂,P₃,...,P_n } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message;

and S105, if the network situation changes or the link fails, the adjustment and optimization are carried out, and the influence of the network dynamic change or the link failure on the network message transmission is reduced.

In S101, the user inputs the unmanned aerial vehicle task or service through the front-end interface, and then relaxes the unmanned aerial vehicle task or service to the task intention translation module for translation.

The step S102 specifically includes:

The user inputs task intention in a natural language form;

The first stage, the knowledge extraction module extracts key knowledge of input intents, if the given intents are complete and effective, key entity information in random Chinese intents can be extracted through named entity recognition, and translated intention tuples are output to realize normalized representation;

And in the second stage, the strategy mapping module performs network demand mapping on the extracted intention tuples in combination with the strategy knowledge graph and the network state knowledge graph to obtain related intention demands, and sends the result to the routing planning module for further configuration in S104.

The step S103 specifically includes:

the situation awareness module comprises 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;

The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, data collection is achieved, then data processing and refinement characterization are carried out on the collected data, and data packets are written into to carry out information transmission sharing and fusion processing to form link situation information;

The information fusion process is as follows:

Setting G= (V, E) to represent a network topology model, wherein V represents a set of all network nodes in the network model, E represents a set of links between network nodes, each link represents a direct path between two adjacent network nodes, setting a source node as s (s epsilon V), a destination node as d (d epsilon V), and having a plurality of transmission paths P (s, d) = { P ₁,P₂,P₃,...,P_n }, wherein P _i＝{e_i1,e_i2,...,e_im }, and giving four QoS metric functions, namely a delay function delay (P), a bandwidth (P), a packet loss rate loss (P) and a load (P), to any one transmission path in the network model;

definition 1 (delay) for any transmission path P _i from source node s to destination node d, the delay calculation formula is as follows:

Delay (e _ik) is the time delay of each segment of link e _ik on the transmission path P _i;

definition 2 (bandwidth) for any transmission path P _i from source node s to destination node d, the bandwidth calculation formula is as follows:

bandwidth(P_i)＝min{bandwidth(e_ik),e_ik∈P_i},

wherein bandwidth (e _ik) is the bandwidth of each segment of link e _ik on transmission path P _i;

definition 3 (packet loss rate) for any transmission path P _i from the source node s to the destination node d, the packet loss rate calculation formula is as follows:

wherein loss (e _ik) is a packet loss rate of each segment of link e _ik on the transmission path P _i;

Definition 4 (load) for any transmission path P _i from source node s to destination node d, the load calculation formula is as follows:

Wherein Hop (P _i) is the Hop count of path P _i, load (e _ik) is the load of each segment of link e _ik on transmission path P _i;

The step S104 specifically includes:

Performing route planning by using a Deep Q-Network (DQN) in Deep reinforcement learning, taking the problem of difficult route planning caused by various QoS requirements of unmanned aerial vehicle task Network task messages into consideration, selecting transmission efficiency Lr and communication quality Lq as unmanned aerial vehicle task Network message transmission route planning targets, wherein the transmission efficiency Lr is measured by time delay, packet loss rate and bandwidth of a message transmission path, the communication quality Lq is measured by time delay, bandwidth, packet loss rate and load of a link, and calculating objective functions L _r and L _q:

L_q＝w₁*delay(P_i)+w₂*bandwidth(P_i)+w₃*loss(P_i)+w₄*load(P_i),

w₁+w₂+w₃+w₄=1,

Where loss (P _i) represents a packet loss rate of a transmission path, bandwidth (P _i) represents a bandwidth of the transmission path, delay (P _i) represents a delay of the transmission path, load (P _i) represents a load of the transmission path, and w ₁,w₂,w₃,w₄ represents a weight of link quality.

The state space contains link situation information and message transmission requirements, which are defined as O= [ O ₁,o₂,...,o_k]^T, wherein o_n＝[delay(e_ik),bandwidth(e_ik),loss(e_ik),load(e_ik),D,B,Loss,Load]^T is a set of situation information of each link e _ik on a transmission path P _i and message transmission requirements of current task intention, the action space contains next hop node information of message transmission, which is defined as a _n＝(next_node_n), and the reward function R (s _t,a_t) is defined as:

R(s_t,a_t)＝α*L_r+β*L_q,

α+β=1,

Wherein α and β are weights of the transmission efficiency L _r and the communication quality L _q;

The intelligent agent autonomously perceives the situation of the surrounding environment to realize situation perception, and generates an effective transmission path in a self-adaptive manner according to planning decision and model training, so as to obtain an optimal transmission path, ensure that the task intention needs are met, the system needs to dynamically interact with the network environment, generate a group of state information at each sampling interval when acquiring a state space, select a nonlinear approximation method of a neural network to approximate a Q function so as to solve a dimensional disaster, and fit a state action value under a high-dimensional continuous state by training the CNN;

the Q function of node n at time t is expressed as Indicating that node n is in stateThe maximum long-term accumulated prize value obtainable by executing action a below, the Q value update equation is expressed as

Wherein α is the learning rate and γ is the discount factor;

to minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as

Wherein, the The weight coefficient of the predicted CNN of the node n;

target Q value is

Wherein, the Is the weight coefficient of the target CNN model;

The Q value at each moment is updated using a CNN consisting of an input layer, a convolutional layer (Conv), a flattening layer, a fully connected layer (FC) and an output layer, wherein Conv has 16 filters of 10 x 4 size and 2 steps, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as

Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority pj of each empirical sample j is determined from a state estimation error (TD error) δ _j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta _j being 0, a small positive number ζ is added to it

p_j＝|δ_j|+ζ

The playback probability p (j) of experience j is:

Wherein, alpha is a parameter for controlling priority, when alpha is 0, priority experience playback is changed into random sampling, a random priority sampling mode is used for relieving the loss of sampling diversity, a bias is introduced for biased sampling, a bias is corrected by using an important sampling weight w _j for ensuring that a biased sampling learning strategy is the same as a uniform sampling strategy, the maximum weight value is used for normalization in consideration of stability, beta is an annealing factor, beta is annealed from an initial value to 1 at the end of learning and acts together with alpha to correct the bias, and n is the sampling number.

The method comprises the steps of training, selecting transmission paths randomly at the first K moments to fill data into an experience pool, selecting actions by an agent based on information of a state space by using a greedy strategy, continuously storing experience samples of interaction between the agent and the environment in the experience pool to enable the agent to fully explore the surrounding environment, and selecting priority weights randomly after being segmented according to experience sampling numbers in a learning stage, wherein an online decision result of CNN is the optimal transmission path of task information available at the current moment.

The specific steps of S105 are as follows:

When the network situation changes or the link fails and the transmission path no longer meets the QoS requirement of task message transmission, the DQN is used for carrying out routing planning again in combination with the current network situation information and adjusting and optimizing the transmission path so as to reduce the influence caused by the dynamic change of the network or the link failure and enable the network to have real-time decision making and optimizing capability.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing steps in a deep reinforcement learning based unmanned aerial vehicle task network message transmission route planning method when executing the computer program.

An information data processing system provides a machine learning environment, has self-learning and self-generating capabilities, and provides an intelligent foundation for the method.

The invention has the beneficial effects that:

The invention can realize deep understanding of the task intention of the unmanned aerial vehicle, automatically plan the message transmission path according to the need, realize accurate and efficient distribution of the task message of the unmanned aerial vehicle, adaptively adjust the task message distribution path of the unmanned aerial vehicle according to the network environment which changes in real time, improve the efficiency of the unmanned aerial vehicle task network and meet the high-performance requirement of the task under the dynamic environment. Meanwhile, management personnel or users can be helped to adjust network resource deployment in time, and therefore adaptability and reliability of the unmanned aerial vehicle task network are improved.

The present invention introduces a priority experience playback mechanism. The priority of each empirical sample is determined based on a state estimation error (TD error) for each time step. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.

Drawings

FIG. 1 is a flow chart of the system of the present invention.

Fig. 2 is a specific implementation of the method of the present invention.

FIG. 3 is a flow chart showing the task intent translation module according to the present invention.

Fig. 4 is a flow chart of an embodiment of the situation awareness module of the present invention.

FIG. 5 is a flow chart of an embodiment of the deep reinforcement learning algorithm of the present invention.

Fig. 6 is a schematic diagram comparing the present invention with AODV and OLSR routing protocols.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

Aiming at the problems in the prior art, the invention provides an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the task intent translation module, the situation awareness module and the route planning module are included. The task intention translation module obtains task intention key information through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands. The situation awareness module collects key data of network situation information, performs centralized processing and management on the collected data, comprehensively evaluates the situation of each path in the network, and forms a resource situation database so as to provide basis for a subsequent route planning module. The routing planning module obtains a transmission path suitable for network messages from a source node to a destination node by analyzing QoS requirements and situation information of links in different network task intents and using a deep reinforcement learning algorithm so as to improve the transmission efficiency of the network messages and reduce delay and cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.

As shown in fig. 2, the method for planning a task network message transmission route of an unmanned aerial vehicle based on deep reinforcement learning according to the embodiment of the invention specifically includes the following implementation steps:

s101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module.

S102, the unmanned aerial vehicle task realizes standardization representation through a task intention translation module and knowledge extraction, outputs intention as related task message requirement through requirement mapping, and sends the result to a route planning module to provide basis for the task intention.

S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module.

S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P ₁,P₂,P₃,...,P_n } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message.

As shown in fig. 3, the specific implementation flow of the task intent translation module in S101 provided in the embodiment of the present invention includes:

The user inputs task intents in a natural language form, a knowledge extraction module extracts key knowledge of the input intents, if the given intents are complete and effective, the key entity information in the irregular Chinese intents can be extracted through named entity recognition, translated intention tuples are output to realize standardized characterization, and a strategy mapping module carries out network demand mapping on the intention tuples formed after extraction in combination with a strategy knowledge map and a network state knowledge map to obtain related intention demands, and the result is sent to a routing planning module for further configuration in S104.

As shown in fig. 4, a specific implementation flow of the situation awareness module in S103 provided by the embodiment of the present invention includes:

And 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion, and generation and maintenance of a situation database. The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, and then performs fine characterization on the collected data and writes the data into a data packet to perform transmission sharing and fusion processing to form link situation information.

The centralized fusion process of the situation information is as follows, setting g= (V, E) to represent a network topology model, wherein V represents a set of all network node components in the network model, E represents a set of link components between network nodes, each link representing a direct path between two adjacent network nodes. Let s (s e V) be the source node and d (d e V) be the destination node, have multiple transmission paths P (s, d) = { P ₁,P₂,P₃,...,P_n }, where P _i＝{e_i1,e_i2,...,e_im }. Four QoS metric functions are given to any one transmission path in the network model, namely delay function delay (P), bandwidth (P), packet loss rate loss (P) and load (P).

bandwidth(P_i)＝min{bandwidth(e_ik),e_ik∈P_i}

Definition 4 (load) for any transmission path P _i from source node s to destination node d, hop count Hop (P) for the path, the load calculation formula is as follows:

As shown in fig. 5, the specific implementation flow of the S104 deep reinforcement learning algorithm provided in the embodiment of the present invention includes:

The Deep Q-Network (DQN) in Deep reinforcement learning is used for route planning, and the transmission efficiency L _r and the communication quality L _q are selected as unmanned aerial vehicle task Network message transmission route planning targets in consideration of the problem that the route planning is difficult due to various QoS requirements of unmanned aerial vehicle task Network task messages. The transmission efficiency L _r is measured by the delay, the packet loss rate and the bandwidth of the message transmission, and the communication quality L _q is measured by the delay, the bandwidth, the packet loss rate and the load of the link. The objective functions L _r and L _q are calculated:

Lq＝w₁*delay(P_i)+w₂*bandwidth(P_i)+w₃*loss(P_i)+w₄*load(P_i)

w₁+w₂+w₃+w₄=1

The state space contains link situation information and message transmission requirements, defined as o= [ O ₁,o₂,...,o_k]^T, where o_n＝[delay(e_ik),bandwidth(e_ik),loss(e_ik),load(e_ik),D,B,Loss,Load]^T is the set of situation information and message transmission requirements for the current task intent for each segment of link e _ik on transmission path P _i. The action space contains the next hop node information for the message transmission, defined as a _n＝(next_node_n). To improve the transmission efficiency and communication quality of task message transmission, the reward function is defined as:

R(s_t,a_t)＝α*L_r+β*L_q

α+β=1

The intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, and an effective transmission path is adaptively generated according to planning decisions and model training. In order to obtain an optimal transmission path, it is ensured that task intention requirements are met, and the system needs to dynamically interact with a network environment. Each sampling interval generates a set of state information when acquiring a state space. And a nonlinear approximation method of the neural network is selected to approximate the Q function, so that the dimension disaster is solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.

The Q function of node n at time t is expressed asIndicating that a node is in stateThe maximum length accumulated prize value that can be achieved by executing action a down. The Q value update equation is expressed as

Where α is the learning rate and γ is the discount factor.

Wherein, the The weighting factor of the predicted CNN for node n.

Target Q value is

Wherein, the Is the weighting coefficient of the target CNN model.

To update the Q value at each time instant, CNN composed of an input layer, a convolution layer (Conv), a flattening layer, a full connection layer (FC), and an output layer is used. Wherein Conv has 16 filters with a size of 10×4 and a stride of 2, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as

Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority p _j of each empirical sample j is determined from the state estimation error (TD error) delta _j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta _j being 0, a small positive number ζ is added to it

p_j＝|δ_j|+ζ

The playback probability p (j) of experience j is:

Where α is a parameter controlling priority, when α is 0, the priority empirical playback becomes random sampling. To mitigate the loss of sample diversity, a random priority sampling approach is used. Since the biased sampling introduces the bias, the importance sampling weight w _j is used to correct the bias in order to ensure that the biased sampling learning strategy is the same as the uniform sampling strategy. The normalization is performed using the maximum weight value in consideration of stability. Beta is an annealing factor that will anneal from an initial value to 1 at the end of learning, acting together with alpha to correct the bias, n is the number of samples.

In the training process, transmission paths are randomly selected from the first K moments, and data are filled into an experience pool. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the intelligent agent can fully explore the surrounding environment. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.

The technical effects of the present invention will be described in detail with reference to simulation.

The simulation adopts a modularized design, constructs an unmanned aerial vehicle task network simulation platform, simulates in a QT 5.15.2 platform, and uses Pycharm to realize intelligent route planning based on a deep reinforcement learning algorithm. Through interaction of the two, task message QoS requirements and network situation information are fully considered, dynamic self-adaptive route planning is realized, and the dynamic self-adaptive route planning is compared with AODV and OLSR routing protocols.

As shown in fig. 6, the average route generation time of different task messages in the same network scene is counted, and the average route generation time of the route planning method of the invention is relatively stable and basically controlled within 1.2 seconds. The method can rapidly plan the routing strategy in the dynamic unmanned aerial vehicle task network environment, and has good instantaneity. In contrast, the average route generation time of AODV and OLSR routing protocols fluctuates greatly.

1. For the problem of poor dynamic adaptation, the invention introduces an intention driving network, translates unmanned aerial vehicle tasks into message QoS requirements and generates a routing strategy based on a task intention translation module and a situation awareness module. And then continuously monitoring the network state in real time, and when the generated routing strategy no longer meets the QoS requirement of the message due to the change of the network situation or the link fault, re-performing the routing planning to adjust and optimize the routing strategy so as to reduce the influence caused by the dynamic change of the network or the link fault and improve the dynamic adaptability of the routing planning.

2. Aiming at the problems of complex calculation and poor model training effect, the invention uses deep reinforcement learning to carry out intelligent route planning, combines the neural network into an intelligent agent in a reinforcement learning framework, solves the high-dimensional decision problem of the unmanned aerial vehicle task network, ensures the accuracy of route planning, improves the network message transmission efficiency, and reduces delay and cost.

It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portions may be implemented using dedicated logic and the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or dedicated design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims

1. A UAV mission network message transmission routing planning system based on deep reinforcement learning, characterized by including a mission intent translation module, a situational awareness module, and a routing planning module;

The mission intent translation module obtains key mission intent information from the drone business or drone mission input by the user through knowledge extraction technology, and combines the strategy knowledge graph and network situation knowledge graph to perform network demand mapping to obtain the QoS requirements of related mission messages;

The situation awareness module collects key data of network situation information and centrally processes and manages the collected link delay, packet loss rate, bandwidth and load-related data, comprehensively evaluates the situation of each path in the network, and forms a resource situation database to provide a basis for the subsequent routing planning module;

The routing planning module analyzes the QoS requirements and network link status information in different network task intentions, and uses deep reinforcement learning to obtain a transmission path suitable for network messages from the source node to the destination node, thereby improving network message transmission efficiency, reducing latency and costs; at the same time, it adjusts and optimizes when the network situation or node status changes to ensure the transmission of task messages.

2. The deep reinforcement learning-based UAV mission network message transmission routing planning system according to claim 1 is characterized in that the mission intent translation module is divided into two submodules: a knowledge extraction module and a requirement mapping module; the knowledge extraction module extracts key knowledge from the input intent;

The input intent undergoes named entity recognition to extract key entity information from irregular Chinese intents and outputs the translated intent tuple to achieve standardized representation.

The demand mapping module combines the extracted intention tuples with the strategy knowledge graph and the network situation knowledge graph to perform network demand mapping, obtains relevant task message requirements, and sends the results to the routing planning module to provide a basis for it.

3. The UAV mission network message transmission routing planning system based on deep reinforcement learning according to claim 1 is characterized in that the situational awareness module performs six steps: data acquisition, data processing and detailed representation, information transmission, information fusion, and generation and maintenance of the situation database;

The situational awareness module first realizes the collection of data related to link delay, packet loss rate, bandwidth and load during network operation to achieve data collection;

The collected data is then processed and refined, and written into data packets for information transmission, sharing, and fusion processing to form link situation information, with the goal of comprehensively evaluating the situation of each path.

Finally, a situation database is generated and dynamically maintained, and the link situation information in the database is the data basis for the subsequent routing planning module.

4. The deep reinforcement learning-based UAV mission network message transmission routing planning system according to claim 1, characterized in that the routing planning module uses the node set in the network, the intent tuple obtained by the mission intent translation module, the mission message transmission QoS requirement information, and the link situation information obtained by the situation awareness module as initial inputs to the routing planning module;

The routing planning module uses a deep reinforcement learning algorithm to search for the optimal transmission path among multiple reachable paths P(s,d)={ _P1 , _P2 , _P3 , ..., _Pn } from the source node s to the destination node d in the network topology graph based on the QoS requirements of the current message transmission and link status information. The obtained path meets the QoS requirements of the task message and replans the transmission path that does not meet the QoS requirements when the link status changes, ensuring the real-time and accurate transmission of the task message.

5. A method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to the system of any one of claims 1 to 4, characterized in that it comprises the following steps:

S101, the user inputs the drone mission and sends it to the mission intention translation module;

S102: The UAV mission is translated into a mission intent by the mission intent translation module, and the knowledge extraction module extracts knowledge to achieve a standardized representation. The strategy mapping module outputs the intent as a relevant mission message requirement, and the result is sent to the routing planning module to provide a basis for it.

S103, the situation awareness module collects key data of network situation information and processes and manages it to form a resource situation database, providing a data basis for the routing planning module;

S104: Based on the QoS requirements of the message transmission required by the current task and the link status information, the routing planning module uses a deep reinforcement learning algorithm to search for the optimal transmission path from the multiple reachable paths P(s,d) = { _P1 , _P2 , _P3 , ..., _Pn } from the source node s to the destination node d in the network topology graph. The resulting path meets the QoS requirements of the task message.

S105 , if the network situation changes or a link fails, adjustments and optimizations are performed to reduce the impact of the network dynamic changes or link failures on network message transmission.

6. A method for network message transmission routing planning for UAV missions based on deep reinforcement learning according to claim 5, characterized in that in S101, the user inputs the UAV mission or business through the front-end interface, and then sends it to the mission intent translation module for translation processing.

7. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 5, wherein S102 specifically comprises:

The user inputs the task intention in natural language;

In the first stage, the knowledge extraction module extracts key knowledge from the input intent. If the given intent is complete and valid, named entity recognition is performed to extract key entity information from the irregular Chinese intent and output the translated intent tuple to achieve standardized representation.

In the second stage, the strategy mapping module combines the extracted intent tuples with the strategy knowledge graph and the network status knowledge graph to map the network requirements, obtain relevant intent requirements, and send the results to the routing planning module in S104 for further configuration.

8. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 5, wherein S103 specifically comprises:

The situation awareness module includes six links, namely data collection, data processing and detailed representation, information transmission, information fusion, and generation and maintenance of situation database;

The situation awareness module first collects data related to the network situation during network operation, realizes data collection, and then processes and refines the collected data and writes it into data packets for information transmission, sharing and fusion processing to form link situation information;

The information fusion process is as follows:

Assume G = (V, E) to represent the network topology model, where V represents the set of all network nodes in the network model, E represents the set of links between network nodes, and each link represents a direct path between two adjacent network nodes. Assume the source node is s (s∈V) and the destination node is d (d∈V), with multiple transmission paths P(s, d) = {P ₁ ,P ₂ ,P ₃ ,...,P _n }, where P _i = {e _i1 ,e _i2 ,...,e _im }. For any transmission path in the network model, four QoS metric functions are given: delay (P), bandwidth (P), packet loss (P), and load (P).

Delay Definition 1: For any transmission path P _i from source node s to destination node d, the delay is calculated as follows:

Where, delay(e _ik ) is the delay of each link e _ik on the transmission path P _i ;

Bandwidth Definition 2: For any transmission path P _i from source node s to destination node d, the bandwidth is calculated as follows:

bandwidth(P _i )=min{bandwidth(e _ik ),e _ik ∈P _i },

Where, bandwidth(e _ik ) is the bandwidth of each link e _ik on the transmission path P _i ;

Packet loss rate definition 3: For any transmission path P _i from source node s to destination node d, the packet loss rate is calculated as follows:

Where loss(e _ik ) is the packet loss rate of each link e _ik on the transmission path P _i ;

Load Definition 4: For any transmission path P _i from source node s to destination node d, the load calculation formula is as follows:

Where Hop(P _i ) is the number of hops of path P _i , load(e _ik ) is the load of each link e _ik on the transmission path P _i ;

9. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 8, wherein S104 specifically comprises:

Using the deep Q-network in deep reinforcement learning for route planning, we select transmission efficiency L _r and communication quality L _q as the objectives of message transmission routing planning for the UAV mission network. Transmission efficiency L _r is measured by the message transmission path delay, packet loss rate, and bandwidth, and communication quality L _q is measured by the link delay, bandwidth, packet loss rate, and load. Calculate the objective functions L _r and L _q as follows:

L _q ＝w ₁ *delay(P _i )+w ₂ *bandwidth(P _i )+w ₃ *loss(P _i )+w ₄ *load(P _i ),

w ₁ +w ₂ +w ₃ +w ₄ = 1,

Where loss(P _i ) represents the packet loss rate of the transmission path, bandwidth(P _i ) represents the bandwidth of the transmission path, delay(P _i ) represents the delay of the transmission path, load(P _i ) represents the load of the transmission path, w ₁ ,w ₂ ,w ₃ ,w ₄ represent the weights of the link quality;

The state space contains link status information and message transmission requirements, and is defined as O = [o ₁ ,o ₂ ,…, _ok ] ^T , where o _n = [delay(e _ik ), bandwidth(e _ik ), loss(e _ik ), load(e _ik ), D, B, Loss, Load] ^T is the set of status information of each link e _ik on the transmission path P _i and the message transmission requirements of the current task intention; the action space contains the next hop node information of the message transmission, and is defined as a _n = (next_node _n ); the reward function R(s _t ,a _t ) is defined as:

R(s _t ,a _t )=α*L _r +β*L _q ,

α+β＝1,

Among them, α and β are the weights of transmission efficiency L _r and communication quality L _q ;

The intelligent agent autonomously perceives the surrounding environment to achieve situational awareness and adaptively generates effective transmission paths based on planning decisions and model training. When acquiring the state space, each sampling interval generates a set of state information. A nonlinear approximation method of a neural network is selected to approximate the Q function. Each intelligent node has a convolutional neural network with the same structure. By training the CNN, the state action value under the high-dimensional continuous state is fitted.

The Q function of node n at time t is expressed as Indicates that node n is in state The maximum long-term accumulated reward value that can be obtained by performing action a under the condition of , the Q value update equation is expressed as

Among them, α is the learning rate and γ is the discount factor;

in, is the weight coefficient of the predicted CNN of node n;

The target Q value is

in, is the weight coefficient of the target CNN model;

The CNN consisting of input layer, convolution layer (Conv), flattening layer, fully connected layer (FC) and output layer is used to update the Q value at each moment. Tanh is used as the activation function; the FC activation function is sigmoid; the activation function of the output layer is a linear activation function; the CNN training process uses the gradient descent method, and the gradient of the loss function is expressed as

The priority pj of each experience sample j is determined by the state estimation error _δj at each time step. Experiences with large state estimation errors will be replayed more frequently. A small positive number ζ is added to avoid the failure of experience sample j to be sampled due to _δj being 0.

p _j = |δ _j |+ζ

The replay probability p(j) of experience j is:

Among them, α is the parameter that controls priority. When α is 0, the priority experience replay becomes random sampling. Random priority sampling is used to alleviate the loss of sampling diversity. Importance sampling weights _wj are used to correct bias. The maximum weight value is used for normalization. β is the annealing factor. At the end of learning, β will anneal from the initial value to 1 and work together with α to correct bias. n is the number of samples.

During the training process, transmission paths are randomly selected in the first K moments to fill the experience pool with data. Afterwards, the agent uses a greedy strategy to select actions based on the information in the state space and continuously stores experience samples of the agent's interaction with the environment in the experience pool, allowing the agent to fully explore the surrounding environment. The CNN's online decision result is the optimal transmission path for the task message available at the current moment. During the learning phase, priority weights are randomly selected after being segmented according to the number of experience samples, and the experience corresponding to the weight is selected.

10. The method for planning network message transmission routes for UAV missions based on deep reinforcement learning according to claim 9, wherein the specific steps of S105 are:

The network status is monitored in real time. When the network situation changes or a link fails, causing the transmission path to no longer meet the QoS requirements of task message transmission, DQN is used to re-plan the route and adjust and optimize the transmission path based on the current network situation information to reduce the impact of dynamic network changes or link failures, enabling the network to have real-time decision-making and optimization capabilities.