Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, which use DQN in the deep reinforcement learning to carry out route planning and solve the problem of 'dimension disaster' caused by high dynamic performance of an unmanned aerial vehicle task network. The method realizes the double driving of the intention situation, and under the condition of limited network resources, in order to ensure the delivery of high-value task information, the transmission path suitable for the network information from the source node to the destination node is obtained by analyzing the QoS requirements and the situation information of links in the task intention of the task network of different unmanned aerial vehicles by using the DQN, so as to improve the transmission efficiency of the network information and reduce the delay and the cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the unmanned aerial vehicle task network message transmission route planning system based on deep reinforcement learning comprises a task intention translation module, a situation awareness module and a route planning module;
The task intention translation module obtains task intention key information from unmanned aerial vehicle service or unmanned aerial vehicle task input by a user through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain QoS demands of related task messages;
the situation awareness module collects key data of network situation information, performs centralized processing and management on the collected link delay, packet loss rate, bandwidth and load related data, comprehensively evaluates the situation of each path in the network, forms a resource situation database and provides a basis for a subsequent route planning module;
The routing planning module obtains a transmission path suitable for network information from a source node to a destination node by analyzing QoS requirements in different network task intents and situation information of network links and using deep reinforcement learning so as to improve the transmission efficiency of the network information and reduce delay and cost, and meanwhile, when the network situation or the node state changes, the routing planning module adjusts and optimizes the routing planning module so as to ensure that the task information is transmitted reliably in time.
The task intention translation module is divided into a knowledge extraction module and a demand mapping module, wherein the knowledge extraction module carries out key knowledge extraction on input intention;
the input intention can be identified through a named entity, key entity information in irregular Chinese intention is extracted, and translated intention tuples are output, so that normalized representation is realized;
And the demand mapping module performs network demand mapping on the extracted intention tuple in combination with a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands, and sends the result to the route planning module to provide basis for the route planning module.
Further, the situation awareness module comprises 6 links of data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;
the situation awareness module firstly collects network situation related data (parameters such as network bandwidth, time delay, packet loss rate, channel load and the like) in the network operation process, and data collection is realized;
Then, carrying out data processing and refinement characterization on the acquired data, writing the data into a data packet, and carrying out information transmission sharing and information fusion processing to form link situation information, so as to comprehensively evaluate the situation of each path;
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
Further, in the route planning module, a node set in the network, an intention tuple obtained by the task intention translation module, task message transmission QoS requirement information and link situation information obtained by the situation awareness module are used as initial inputs of the route planning module;
The route planning module searches an optimal transmission path in a plurality of reachable paths P (s, d) = { P 1,P2,P3,...,Pn } from a source node s to a destination node d of the network topology graph through a deep reinforcement learning algorithm according to the QoS requirement and link situation information of the current required transmission message, the obtained path meets the QoS requirement of the task message, and when the link situation changes, the transmission path which does not meet the QoS requirement is planned again in time through the algorithm, so that the real-time accurate transmission of the task message is ensured.
Further, the deep reinforcement learning algorithm in the routing planning module performs routing planning by using the DQN, selects transmission efficiency and communication quality as planning targets, wherein the transmission efficiency is measured by time delay, packet loss rate and bandwidth of message transmission, and the communication quality is measured by time delay, bandwidth, packet loss rate, load and jitter of a link;
The state space of the deep reinforcement learning algorithm comprises link situation information and task message demands, the action space comprises a next-hop node for message transmission, an intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, an effective transmission path is adaptively generated according to planning decision and model training, in order to obtain an optimal transmission path, the task intention demands are ensured to be met, the system needs to dynamically interact with a network environment, a group of state information is generated at each sampling interval when the state space is acquired, the state information quantity is huge due to high dynamic mobility of an unmanned plane, and a nonlinear approximation method of a neural network is selected to approximate a Q function, so that dimension disasters are solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.
Different experiences in the experience pool of the algorithm have different influences on the strategy learning of the intelligent agent, and a priority experience playback mechanism is introduced for fully utilizing high-quality experiences to improve model convergence efficiency. The priority of each empirical sample is determined based on the state estimation error (TD error) of the sample at each time step in the DQN algorithm. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
An unmanned aerial vehicle task network message transmission route planning method based on deep reinforcement learning comprises the following steps of;
S101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module;
s102, an unmanned aerial vehicle task realizes standardized representation through a task intention translation module and knowledge extraction through a knowledge extraction module, outputs intention as related task message requirement through requirement mapping of a strategy mapping module, and sends a result to a route planning module to provide basis for the task intention;
S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module;
S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P 1,P2,P3,...,Pn } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message;
and S105, if the network situation changes or the link fails, the adjustment and optimization are carried out, and the influence of the network dynamic change or the link failure on the network message transmission is reduced.
In S101, the user inputs the unmanned aerial vehicle task or service through the front-end interface, and then relaxes the unmanned aerial vehicle task or service to the task intention translation module for translation.
The step S102 specifically includes:
The user inputs task intention in a natural language form;
The first stage, the knowledge extraction module extracts key knowledge of input intents, if the given intents are complete and effective, key entity information in random Chinese intents can be extracted through named entity recognition, and translated intention tuples are output to realize normalized representation;
And in the second stage, the strategy mapping module performs network demand mapping on the extracted intention tuples in combination with the strategy knowledge graph and the network state knowledge graph to obtain related intention demands, and sends the result to the routing planning module for further configuration in S104.
The step S103 specifically includes:
the situation awareness module comprises 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion and generation and maintenance of a situation database;
The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, data collection is achieved, then data processing and refinement characterization are carried out on the collected data, and data packets are written into to carry out information transmission sharing and fusion processing to form link situation information;
The information fusion process is as follows:
Setting G= (V, E) to represent a network topology model, wherein V represents a set of all network nodes in the network model, E represents a set of links between network nodes, each link represents a direct path between two adjacent network nodes, setting a source node as s (s epsilon V), a destination node as d (d epsilon V), and having a plurality of transmission paths P (s, d) = { P 1,P2,P3,...,Pn }, wherein P i={ei1,ei2,...,eim }, and giving four QoS metric functions, namely a delay function delay (P), a bandwidth (P), a packet loss rate loss (P) and a load (P), to any one transmission path in the network model;
definition 1 (delay) for any transmission path P i from source node s to destination node d, the delay calculation formula is as follows:
Delay (e ik) is the time delay of each segment of link e ik on the transmission path P i;
definition 2 (bandwidth) for any transmission path P i from source node s to destination node d, the bandwidth calculation formula is as follows:
bandwidth(Pi)=min{bandwidth(eik),eik∈Pi},
wherein bandwidth (e ik) is the bandwidth of each segment of link e ik on transmission path P i;
definition 3 (packet loss rate) for any transmission path P i from the source node s to the destination node d, the packet loss rate calculation formula is as follows:
wherein loss (e ik) is a packet loss rate of each segment of link e ik on the transmission path P i;
Definition 4 (load) for any transmission path P i from source node s to destination node d, the load calculation formula is as follows:
Wherein Hop (P i) is the Hop count of path P i, load (e ik) is the load of each segment of link e ik on transmission path P i;
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
The step S104 specifically includes:
Performing route planning by using a Deep Q-Network (DQN) in Deep reinforcement learning, taking the problem of difficult route planning caused by various QoS requirements of unmanned aerial vehicle task Network task messages into consideration, selecting transmission efficiency Lr and communication quality Lq as unmanned aerial vehicle task Network message transmission route planning targets, wherein the transmission efficiency Lr is measured by time delay, packet loss rate and bandwidth of a message transmission path, the communication quality Lq is measured by time delay, bandwidth, packet loss rate and load of a link, and calculating objective functions L r and L q:
Lq=w1*delay(Pi)+w2*bandwidth(Pi)+w3*loss(Pi)+w4*load(Pi),
w1+w2+w3+w4=1,
Where loss (P i) represents a packet loss rate of a transmission path, bandwidth (P i) represents a bandwidth of the transmission path, delay (P i) represents a delay of the transmission path, load (P i) represents a load of the transmission path, and w 1,w2,w3,w4 represents a weight of link quality.
The state space contains link situation information and message transmission requirements, which are defined as O= [ O 1,o2,...,ok]T, wherein on=[delay(eik),bandwidth(eik),loss(eik),load(eik),D,B,Loss,Load]T is a set of situation information of each link e ik on a transmission path P i and message transmission requirements of current task intention, the action space contains next hop node information of message transmission, which is defined as a n=(next_noden), and the reward function R (s t,at) is defined as:
R(st,at)=α*Lr+β*Lq,
α+β=1,
Wherein α and β are weights of the transmission efficiency L r and the communication quality L q;
The intelligent agent autonomously perceives the situation of the surrounding environment to realize situation perception, and generates an effective transmission path in a self-adaptive manner according to planning decision and model training, so as to obtain an optimal transmission path, ensure that the task intention needs are met, the system needs to dynamically interact with the network environment, generate a group of state information at each sampling interval when acquiring a state space, select a nonlinear approximation method of a neural network to approximate a Q function so as to solve a dimensional disaster, and fit a state action value under a high-dimensional continuous state by training the CNN;
the Q function of node n at time t is expressed as Indicating that node n is in stateThe maximum long-term accumulated prize value obtainable by executing action a below, the Q value update equation is expressed as
Wherein α is the learning rate and γ is the discount factor;
to minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as
Wherein, the The weight coefficient of the predicted CNN of the node n;
target Q value is
Wherein, the Is the weight coefficient of the target CNN model;
The Q value at each moment is updated using a CNN consisting of an input layer, a convolutional layer (Conv), a flattening layer, a fully connected layer (FC) and an output layer, wherein Conv has 16 filters of 10 x 4 size and 2 steps, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as
Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority pj of each empirical sample j is determined from a state estimation error (TD error) δ j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta j being 0, a small positive number ζ is added to it
pj=|δj|+ζ
The playback probability p (j) of experience j is:
Wherein, alpha is a parameter for controlling priority, when alpha is 0, priority experience playback is changed into random sampling, a random priority sampling mode is used for relieving the loss of sampling diversity, a bias is introduced for biased sampling, a bias is corrected by using an important sampling weight w j for ensuring that a biased sampling learning strategy is the same as a uniform sampling strategy, the maximum weight value is used for normalization in consideration of stability, beta is an annealing factor, beta is annealed from an initial value to 1 at the end of learning and acts together with alpha to correct the bias, and n is the sampling number.
The method comprises the steps of training, selecting transmission paths randomly at the first K moments to fill data into an experience pool, selecting actions by an agent based on information of a state space by using a greedy strategy, continuously storing experience samples of interaction between the agent and the environment in the experience pool to enable the agent to fully explore the surrounding environment, and selecting priority weights randomly after being segmented according to experience sampling numbers in a learning stage, wherein an online decision result of CNN is the optimal transmission path of task information available at the current moment.
The specific steps of S105 are as follows:
When the network situation changes or the link fails and the transmission path no longer meets the QoS requirement of task message transmission, the DQN is used for carrying out routing planning again in combination with the current network situation information and adjusting and optimizing the transmission path so as to reduce the influence caused by the dynamic change of the network or the link failure and enable the network to have real-time decision making and optimizing capability.
A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing steps in a deep reinforcement learning based unmanned aerial vehicle task network message transmission route planning method when executing the computer program.
An information data processing system provides a machine learning environment, has self-learning and self-generating capabilities, and provides an intelligent foundation for the method.
The invention has the beneficial effects that:
The invention can realize deep understanding of the task intention of the unmanned aerial vehicle, automatically plan the message transmission path according to the need, realize accurate and efficient distribution of the task message of the unmanned aerial vehicle, adaptively adjust the task message distribution path of the unmanned aerial vehicle according to the network environment which changes in real time, improve the efficiency of the unmanned aerial vehicle task network and meet the high-performance requirement of the task under the dynamic environment. Meanwhile, management personnel or users can be helped to adjust network resource deployment in time, and therefore adaptability and reliability of the unmanned aerial vehicle task network are improved.
The present invention introduces a priority experience playback mechanism. The priority of each empirical sample is determined based on a state estimation error (TD error) for each time step. The transmission path is randomly selected in the early stage of the training process, and the experience pool is filled with data. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the environment can be fully explored. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
Aiming at the problems in the prior art, the invention provides an unmanned aerial vehicle task network message transmission route planning system and method based on deep reinforcement learning, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the task intent translation module, the situation awareness module and the route planning module are included. The task intention translation module obtains task intention key information through a knowledge extraction technology, and performs network demand mapping by combining a strategy knowledge graph and a network situation knowledge graph to obtain related task message demands. The situation awareness module collects key data of network situation information, performs centralized processing and management on the collected data, comprehensively evaluates the situation of each path in the network, and forms a resource situation database so as to provide basis for a subsequent route planning module. The routing planning module obtains a transmission path suitable for network messages from a source node to a destination node by analyzing QoS requirements and situation information of links in different network task intents and using a deep reinforcement learning algorithm so as to improve the transmission efficiency of the network messages and reduce delay and cost. And meanwhile, when the network situation or the node state changes, the adjustment and the optimization are carried out, so that the task message is ensured to be transmitted timely and reliably.
As shown in fig. 2, the method for planning a task network message transmission route of an unmanned aerial vehicle based on deep reinforcement learning according to the embodiment of the invention specifically includes the following implementation steps:
s101, a user inputs an unmanned aerial vehicle task and sends the unmanned aerial vehicle task to a task intention translation module.
S102, the unmanned aerial vehicle task realizes standardization representation through a task intention translation module and knowledge extraction, outputs intention as related task message requirement through requirement mapping, and sends the result to a route planning module to provide basis for the task intention.
S103, the situation awareness module collects key data of the network situation information, processes and manages the key data to form a resource situation database, and provides a data basis for the routing planning module.
S104, the route planning module searches an optimal transmission path in a plurality of reachable paths P (S, d) = { P 1,P2,P3,...,Pn } from a source node S to a destination node d of the network topological graph through a deep reinforcement learning algorithm according to QoS requirements and link situation information of a message required to be transmitted by a current task, and the obtained paths meet the QoS requirements of the task message.
And S105, if the network situation changes or the link fails, the adjustment and optimization are carried out, and the influence of the network dynamic change or the link failure on the network message transmission is reduced.
As shown in fig. 3, the specific implementation flow of the task intent translation module in S101 provided in the embodiment of the present invention includes:
The user inputs task intents in a natural language form, a knowledge extraction module extracts key knowledge of the input intents, if the given intents are complete and effective, the key entity information in the irregular Chinese intents can be extracted through named entity recognition, translated intention tuples are output to realize standardized characterization, and a strategy mapping module carries out network demand mapping on the intention tuples formed after extraction in combination with a strategy knowledge map and a network state knowledge map to obtain related intention demands, and the result is sent to a routing planning module for further configuration in S104.
As shown in fig. 4, a specific implementation flow of the situation awareness module in S103 provided by the embodiment of the present invention includes:
And 6 links, namely data acquisition, data processing and refinement characterization, information transmission, information fusion, and generation and maintenance of a situation database. The situation awareness module firstly collects network situation related data (network bandwidth, time delay, packet loss rate and channel load parameters) in the network operation process, and then performs fine characterization on the collected data and writes the data into a data packet to perform transmission sharing and fusion processing to form link situation information.
The centralized fusion process of the situation information is as follows, setting g= (V, E) to represent a network topology model, wherein V represents a set of all network node components in the network model, E represents a set of link components between network nodes, each link representing a direct path between two adjacent network nodes. Let s (s e V) be the source node and d (d e V) be the destination node, have multiple transmission paths P (s, d) = { P 1,P2,P3,...,Pn }, where P i={ei1,ei2,...,eim }. Four QoS metric functions are given to any one transmission path in the network model, namely delay function delay (P), bandwidth (P), packet loss rate loss (P) and load (P).
Definition 1 (delay) for any transmission path P i from source node s to destination node d, the delay calculation formula is as follows:
definition 2 (bandwidth) for any transmission path P i from source node s to destination node d, the bandwidth calculation formula is as follows:
bandwidth(Pi)=min{bandwidth(eik),eik∈Pi}
definition 3 (packet loss rate) for any transmission path P i from the source node s to the destination node d, the packet loss rate calculation formula is as follows:
Definition 4 (load) for any transmission path P i from source node s to destination node d, hop count Hop (P) for the path, the load calculation formula is as follows:
And finally, generating a situation database and performing active maintenance, wherein the link situation information in the database is the data base of the subsequent route planning module.
As shown in fig. 5, the specific implementation flow of the S104 deep reinforcement learning algorithm provided in the embodiment of the present invention includes:
The Deep Q-Network (DQN) in Deep reinforcement learning is used for route planning, and the transmission efficiency L r and the communication quality L q are selected as unmanned aerial vehicle task Network message transmission route planning targets in consideration of the problem that the route planning is difficult due to various QoS requirements of unmanned aerial vehicle task Network task messages. The transmission efficiency L r is measured by the delay, the packet loss rate and the bandwidth of the message transmission, and the communication quality L q is measured by the delay, the bandwidth, the packet loss rate and the load of the link. The objective functions L r and L q are calculated:
Lq=w1*delay(Pi)+w2*bandwidth(Pi)+w3*loss(Pi)+w4*load(Pi)
w1+w2+w3+w4=1
The state space contains link situation information and message transmission requirements, defined as o= [ O 1,o2,...,ok]T, where on=[delay(eik),bandwidth(eik),loss(eik),load(eik),D,B,Loss,Load]T is the set of situation information and message transmission requirements for the current task intent for each segment of link e ik on transmission path P i. The action space contains the next hop node information for the message transmission, defined as a n=(next_noden). To improve the transmission efficiency and communication quality of task message transmission, the reward function is defined as:
R(st,at)=α*Lr+β*Lq
α+β=1
The intelligent body autonomously perceives the situation of the surrounding environment to realize situation perception, and an effective transmission path is adaptively generated according to planning decisions and model training. In order to obtain an optimal transmission path, it is ensured that task intention requirements are met, and the system needs to dynamically interact with a network environment. Each sampling interval generates a set of state information when acquiring a state space. And a nonlinear approximation method of the neural network is selected to approximate the Q function, so that the dimension disaster is solved. Each intelligent node has a convolutional neural network (Convolutional Neural Networks, CNN) with the same structure, and the state action values under the high-dimensional continuous state are fitted by training the CNN.
The Q function of node n at time t is expressed asIndicating that a node is in stateThe maximum length accumulated prize value that can be achieved by executing action a down. The Q value update equation is expressed as
Where α is the learning rate and γ is the discount factor.
To minimize the gap between the estimated Q value and the target Q value, the loss function of DQN is defined as
Wherein, the The weighting factor of the predicted CNN for node n.
Target Q value is
Wherein, the Is the weighting coefficient of the target CNN model.
To update the Q value at each time instant, CNN composed of an input layer, a convolution layer (Conv), a flattening layer, a full connection layer (FC), and an output layer is used. Wherein Conv has 16 filters with a size of 10×4 and a stride of 2, tan h is used as an activation function, FC has 512 units, the activation function is sigmoid, and the activation function of the output layer is a linear activation function. The training process of CNN adopts gradient descent method, and the gradient of loss function is expressed as
Different experiences in the DQN experience pool have different influences on agent strategy learning, and a priority experience playback mechanism is introduced for fully utilizing high-quality experience to improve model convergence efficiency. The priority p j of each empirical sample j is determined from the state estimation error (TD error) delta j for each time step. Experiences with large TD error will be played back more frequently. To avoid that the empirical sample j cannot be sampled due to delta j being 0, a small positive number ζ is added to it
pj=|δj|+ζ
The playback probability p (j) of experience j is:
Where α is a parameter controlling priority, when α is 0, the priority empirical playback becomes random sampling. To mitigate the loss of sample diversity, a random priority sampling approach is used. Since the biased sampling introduces the bias, the importance sampling weight w j is used to correct the bias in order to ensure that the biased sampling learning strategy is the same as the uniform sampling strategy. The normalization is performed using the maximum weight value in consideration of stability. Beta is an annealing factor that will anneal from an initial value to 1 at the end of learning, acting together with alpha to correct the bias, n is the number of samples.
In the training process, transmission paths are randomly selected from the first K moments, and data are filled into an experience pool. And then the intelligent agent selects actions by using a greedy strategy based on the information of the state space, and experience samples of the interaction of the intelligent agent and the environment are continuously stored in an experience pool, so that the intelligent agent can fully explore the surrounding environment. The on-line decision result of the CNN is the optimal transmission path of the task message available at the current moment, and in the learning stage, the priority weight is randomly selected after being segmented according to the experience sampling number, and the experience corresponding to the priority weight is selected.
The technical effects of the present invention will be described in detail with reference to simulation.
The simulation adopts a modularized design, constructs an unmanned aerial vehicle task network simulation platform, simulates in a QT 5.15.2 platform, and uses Pycharm to realize intelligent route planning based on a deep reinforcement learning algorithm. Through interaction of the two, task message QoS requirements and network situation information are fully considered, dynamic self-adaptive route planning is realized, and the dynamic self-adaptive route planning is compared with AODV and OLSR routing protocols.
As shown in fig. 6, the average route generation time of different task messages in the same network scene is counted, and the average route generation time of the route planning method of the invention is relatively stable and basically controlled within 1.2 seconds. The method can rapidly plan the routing strategy in the dynamic unmanned aerial vehicle task network environment, and has good instantaneity. In contrast, the average route generation time of AODV and OLSR routing protocols fluctuates greatly.
1. For the problem of poor dynamic adaptation, the invention introduces an intention driving network, translates unmanned aerial vehicle tasks into message QoS requirements and generates a routing strategy based on a task intention translation module and a situation awareness module. And then continuously monitoring the network state in real time, and when the generated routing strategy no longer meets the QoS requirement of the message due to the change of the network situation or the link fault, re-performing the routing planning to adjust and optimize the routing strategy so as to reduce the influence caused by the dynamic change of the network or the link fault and improve the dynamic adaptability of the routing planning.
2. Aiming at the problems of complex calculation and poor model training effect, the invention uses deep reinforcement learning to carry out intelligent route planning, combines the neural network into an intelligent agent in a reinforcement learning framework, solves the high-dimensional decision problem of the unmanned aerial vehicle task network, ensures the accuracy of route planning, improves the network message transmission efficiency, and reduces delay and cost.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portions may be implemented using dedicated logic and the software portions may be stored in a memory for execution by a suitable instruction execution system, such as a microprocessor or dedicated design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.