CN113485103A

CN113485103A - Aircraft conflict resolution method based on deep reinforcement learning

Info

Publication number: CN113485103A
Application number: CN202110729530.XA
Authority: CN
Inventors: 韩云祥; 张建伟; 何爱平
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-08

Abstract

The invention provides an aircraft conflict resolution method based on deep reinforcement learning, which is based on a deep certainty strategy gradient algorithm, constructs each component and conflict scene of an agent through an Open AI Open source reinforcement learning environment interface Gym, and adopts a DDPG algorithm to learn a resolution strategy. The conflict deployment action of the aircraft intelligent agent relates to the adjustment of a course angle, a flight speed and a height, and the state of the conflict deployment action mainly comprises the description of multiple dimensions such as position information, speed and the like. The algorithm provided by the invention greatly helps to relieve the conflict of the aircrafts in the air traffic control, and can reduce the workload of the control of a controller.

Description

Aircraft conflict resolution method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of civil aviation intelligent air traffic control, in particular to an aircraft conflict resolution method based on deep reinforcement learning.

Background

In 2019, the passenger throughput of the airport in China all the year is more than 13 hundred million people, 135162.9 ten thousand people are completed, and the number is increased by 6.9 percent compared with the number in the last year. According to the prediction of the international air transport association IATA, the number of global air passengers reaches 82 hundred million in 2037, and 16 hundred million Chinese passengers are contained. In order to alleviate the enormous traffic pressure, various air traffic flow management aids and technologies come into force, such as airport collaborative decision management systems, AMAN/DMAN systems, remote towers, collision detection and resolution technologies, and the like. The realization of the efficient collision detection and disengagement technology is the primary task for guaranteeing flight safety, and is particularly important for complex and high-density airspace environments. The operation has great significance for maintaining flight order, preventing collision of aircrafts, relieving air traffic pressure and guaranteeing air traffic safety.

Disclosure of Invention

In order to solve the problems of simple model, poor algorithm self-adaption, low efficiency and the like in the prior art, the invention provides an aircraft conflict resolution method based on deep reinforcement learning, a conflict scene model is built through an Open source air management platform OpenScope, the communication of an aircraft intelligent body is realized by combining an Gym interface, a deep reinforcement learning algorithm is adopted, a deep certainty strategy gradient DDPG is adopted to train the aircraft intelligent body to complete a conflict resolution task, compared with the existing heuristic algorithm, the invention considers the uncertainty of the environment, such as the error of taking maneuvering behavior in the aircraft conflict resolution process, and simultaneously the construction of a simulation environment is carried out by combining an Open AI reinforcement learning interface Gym platform, so that the training process has the advantages of being simpler and more efficient.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose: an aircraft conflict resolution method based on deep reinforcement learning comprises a conflict environment generation module, an intelligent agent communication module and a DDPG reinforcement learning algorithm module; the conflict environment generation module comprises an environment modeling submodule and a conflict scene design submodule, the intelligent body communication module comprises an Gym interface communication submodule and an Openscope empty management submodule, and the DDPG reinforcement learning algorithm module comprises a strategy network submodule Actor, a value network submodule Critic and a historical data experience pool submodule.

The environment modeling submodule is used for modeling an environment of reinforcement learning, and comprises setting and management of parameters such as an airspace range, a flight starting point, a target point, a flight speed and flight density.

The conflict scene design submodule can design different types of preset conflict scenes for the intelligent aircraft, including head-on coming head-on conflicts and lateral cross conflicts, wherein the head-on conflicts are actually a special case of the cross conflicts, namely the head-on conflicts when the magnitude of the course included angle is a flat angle, the conflict scene design submodule can design various different cross conflicts according to the course included angles with different magnitudes, the course included angle refers to the included angle of the course angles of two aircrafts, the magnitude of the included angle between the projection of the longitudinal axis of the course angle of the aircraft carrier on the horizontal plane and the geography meridian is regulated, the geography meridian is used as the starting point, the east direction is positive, and the value range is between plus or minus 180 degrees.

The Gym interface communication sub-module can complete the communication between the intelligent agent of the aircraft and other aircraft, including position information and course information, and can also complete the communication with the DDPG reinforcement learning algorithm module through the module, namely, the Gym interface communication sub-module can send the state information of all aircraft to the algorithm module from the view point of Shangdi, thereby better training the intelligent agent to learn conflict avoidance actions.

The OpenScope air management sub-module provides a human-computer interaction interface and a control interface, and meanwhile, flight control of the aircraft intelligent body is achieved, such as control of the states of the heading, the speed, the altitude and the like. The air management environment is mainly an approach control airspace, and each aircraft is limited by the flight performance of the aircraft type. The intelligent agent continuously interacts with the environment to obtain a feedback value of the environment, and learning of the release strategy is performed through a DDPG algorithm.

The strategy network sub-module belongs to a part of a DDPG algorithm, mainly completes the network learning process of the mapping from the state to the action of an intelligent agent, which is often called as a strategy network, particularly the output of the network is the optimal action of the state of the current network input, in addition, in order to make the learning process more stable, the network weight is more stably updated in an iterative manner, a Target strategy network Target is introduced, the original strategy network is called as an Online strategy network Online, the fixed updating times are set, and the weight of the Online strategy network is copied to the Target network according to the number.

Aiming at a value network submodule, the actions generated by the policy network are mainly evaluated, the mapping from the state and the actions to the Q value is learned, then the policy network learns and releases the policy according to the Q value, the policy network is consistent with the policy network submodule, and a Target evaluation network and an Online evaluation network are also introduced.

The historical data experience pool sub-module mainly completes two functions of storage and sampling, wherein the storage refers to the storage of the historical track of the intelligent agent, namely the state, the action, the reward and the next state, a mark quantity for whether the current task is completed or not can be added according to needs, and the sampling refers to the fact that the intelligent agent inputs the historical track according to a certain batch size in the learning process to learn.

Agent accumulates reward R by maximizing t time_tTo learn the optimal release strategy, formulated as follows:

wherein s is_iAnd a_iRespectively representing the state and the action, r(s)_i,a_i) For a single prize value, γ is a discount factor, indicating how important the future prize is. In policy network module a_iFor deterministic behavior strategies, behavior a at time t_tThe determined values are obtained directly by the function, namely:

a_t＝μ(s_t∣θ^μ)

where μ represents the state-to-action distribution function, s_tIndicating the state at time t, [ theta ]^μIs a weight parameter of the policy network. The solving model adopts a Bellman equation to carry out iterative optimization to finally obtain an optimal strategy, and the equation is expressed as follows:

Q^μ(s_t,a_t)＝E[r(s_t,a_t)+γQ^μ(s_t+1,μ(s_t+1))]

wherein Q^μ(s_t,a_t) Representing a policy network's action cost function, i.e. state s_tTake action a_tEvaluation of quality of r(s)_t,a_t) Indicating an instant prize at time t. The value network further calculates each state s at the moment t_tThe overall strategy was evaluated by the following Q value expectation, formulated as follows:

where ρ is^βThe distribution function of the state s represents the probability of the state of the agent at a certain moment, and the evaluation of the strategy uses the function J_βTo show that the parameters of the value network are optimized by gradients

The updating is carried out in a manner that theta^QAnd theta^μFor the parameters of the value network, N is the number of samples generated by the agent's practice, and the update is as follows:

compared with the prior art, the invention has the following beneficial effects:

1. the deep reinforcement learning technology combines the characteristic fitting capability of deep learning and the autonomous decision-making capability of reinforcement learning, and can well solve the problems of single release mode, solidification, low efficiency and the like of the existing model.

2. The deep reinforcement learning technology considers the conflict in the horizontal direction, the conflict in the vertical direction and the change adjustment of the course and the speed, designs various practical reward values and only expands the releasing strategy to various conflict scenes.

3. The deep reinforcement learning technology does not depend on an accurate aircraft dynamics model, and the conflict resolution method of the multi-dimensional state and the action is adopted to better accord with the actual command habit of an air traffic controller and effectively cope with the uncertainty of the external conditions and the aircraft running state.

Drawings

FIG. 1 is a schematic diagram of the algorithm of the present invention;

FIG. 2 is a system operation diagram of the sub-modules of the present invention.

Detailed Description

In order to better understand the technical principles of the present invention, the present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the aircraft conflict resolution method based on the DDPG algorithm architecture includes four sub-modules, which specifically include: the system comprises a strategy network submodule, a value network submodule, a historical data experience pool submodule and a simulation environment module.

The policy networkThe network subsystem comprises an online strategy network and a copy strategy network, wherein the online strategy network is used for interactive learning with the environment in real time, namely, the intelligent agent takes the current state as input, and the strategy network outputs corresponding action. The copy strategy network is mainly used for stabilizing the training process, namely, the parameters of the strategy network are stably updated by regularly fixing the network parameters. Policy network by computing policy gradients

Updating parameters of the network, wherein the parameters comprise the gradient of the strategy network parameter and the gradient of the value network parameter;

the value network sub-module comprises an online value network and a duplicate value network, wherein the online value network is used for evaluating the advantages and disadvantages of the current strategy, and the duplicate value network is used for stabilizing the updating process of the parameters of the value network and is realized by fixing the parameters of the network periodically. The input of the value network is a binary group formed by the current state and the action output by the strategy network, the output is a corresponding state value function V value or action value function Q value, and the network calculates the Bellman equation to obtain a target value y_iThe difference with the output value of the value network is used as a loss function, and the network parameters are optimized through the gradient value;

the historical data experience pool submodule is mainly used for storing and updating a sample library, wherein one sample is a quadruple and specifically comprises the state of an agent, the action of the agent, the reward value generated by the interaction of the agent and the environment and the next state of the agent. The capacity of the experience pool is relatively fixed, the upper limit value of the sample capacity is set, the number of samples is continuously increased along with the continuous interaction of the intelligent agent and the environment, and when the number of the samples exceeds the threshold, the samples which are the longest in distance from the current time are automatically removed, so that the updating of the sample library is realized.

The simulation environment module mainly refers to a constructed conflict scene. The intelligent agent learning environment is realized through an Gym interface, and the control environment is built through an open-source air control platform OpenScope. Firstly, the airspace in the airport approaching area is mapped, and the longitude and latitude coordinates of an airport fixed point are projected by plane coordinates through coordinate transformation. Next, the internal structure of the agent is built Gym, including the implementation of components such as state collections, action spaces, and state updates. Finally, the constructed environment is registered Gym and a conflict repository is defined.

As shown in fig. 2, the system working schematic diagram of the present invention includes a conflict environment generation module, an agent communication module, and a DDPG reinforcement learning algorithm module; the conflict environment generation module comprises an environment modeling submodule and a conflict scene design submodule, the intelligent body communication module comprises an Gym interface communication submodule and an Openscope empty management submodule, and the DDPG reinforcement learning algorithm module comprises a strategy network submodule Actor, a value network submodule Critic and a historical data experience pool submodule. The conflict environment generation module is communicated with the algorithm module through the intelligent agent communication module.

Claims

1. An aircraft conflict resolution method based on deep reinforcement learning is characterized by comprising a conflict environment generation module, an intelligent agent communication module and a DDPG reinforcement learning module;

(1) the conflict environment generation module comprises an environment modeling submodule and a conflict scene design submodule;

(2) the intelligent body communication module comprises an Gym interface communication sub-module and an Openscope empty pipe sub-module;

(3) the DDPG reinforcement learning module comprises a strategy network sub-module Actor, a value network sub-module Critic and a historical data experience pool sub-module.

2. The method of claim 1, wherein each module further comprises:

(1) the environment modeling submodule is used for modeling an environment of reinforcement learning, and comprises setting and management of parameters such as an airspace range, a flight starting point, a target point, a flight speed and flight density;

(2) the conflict scene design submodule can design different types of preset conflict scenes for the intelligent aircraft, including head-to-head conflicts and lateral cross conflicts in an oncoming mode; the Gym interface communication sub-module can complete the communication between the aircraft intelligent body and other aircraft, including position information and heading information;

(3) the OpenScope air management sub-module provides a man-machine interaction interface simulation environment and a control interface, and meanwhile, flight control of the aircraft intelligent body is achieved, such as control of the states of course, speed, altitude and the like.

3. The method according to claim 2, wherein the simulation environment module is a constructed conflict scene, the learning environment of the agent is realized through an Gym interface, the control environment is constructed through an open source air control platform OpenScope, the airspace of an airport approaching region is mapped, the longitude and latitude coordinates of an airport fixed point are projected through coordinate transformation to form a Gym internal structure of the agent, and the implementation of components such as a state set, an action space and state updating is included.

4. The deep-reinforcement-learning aircraft conflict resolution method of claim 1, 2 or 3, comprising:

(1) the simulation environment is complicated in airspace, different height limits exist in each sector, and the intelligent agent avoids conflicts but exceeds the limits by adjusting the heights and is punished to a certain extent;

(2) the action space of the intelligent agent comprises course angle adjustment, height adjustment and flight speed adjustment, and is limited by the performance parameters of the BADA aircraft model;

(3) the state space of the intelligent agent comprises a plurality of dimensions such as position information, flight speed, course angle and the like, and is normalized before the training process so as to accelerate the convergence speed of the network.