Aerospace 2128999
Aerospace 2128999
Aerospace 2128999
6 1
School of Astronautics, Beihang University, Beijing 100191, China
7 * Correspondence: zhongyuan@buaa.edu.cn
8 Abstract: Aiming at the attack and defense game problem in the target-missile-defender three-
9 body confrontation scenario, intelligent game strategies based on deep reinforcement learning is
10 proposed, including attack strategy applicable to attacking missile and active defense strategy
11 applicable to target/defender. First, based on the classical three-body adversarial research, the
12 reinforcement learning algorithm is introduced to improve the purposefulness of the algorithm
13 training. The action space the reward and punishment conditions of both attack and defense
14 confrontation are considered in the reward function design. Through analysis of the sign of the
15 action space and design of the reward function in the adversarial form, the combat requirements
16 can be satisfied in both missile and target/defender training. Then, a curriculum-based deep
17 reinforcement learning algorithm is applied to train the agents, and a convergent game strategy is
18 obtained. The simulation results show that the attack strategy of the missile can maneuver
19 according to the battlefield situation, and can successfully hit the target after avoiding the
20 defender. The active defense strategy enables the less capable target/defender to achieve the effect
21 Citation: To be added by editorial similar to network adversarial attack on the missile agent, shielding targets from attack against
22 staff during production. missile with superior maneuverability on the battlefield.
Academic Editor: Firstname
23 Lastname Keywords: target-missile-defender engagement; three-body game; curriculum learning; deep
24 reinforcement learning; intelligent game; active defense
Received: date
25
Accepted: date
Published: date
31 adopt new game strategies so as to gain battlefield advantages. Among them, the target-
32 missile-defender (TMD) three-body engagement triggered by active target defense has
33 attracted increasing research interest [1–7]. In a typical three-body confrontation
34 scenario, three types of vehicles are involved: the target (usually a high-value vehicle
35 such as an aircraft or ballistic missile), an attacking missile to attack the target, and a
36 defender missile to intercept the attacking missile. This combat scenario breaks the
37 traditional pursuit-evasion model with greater complexity, and provides more
38 possibilities for battlefield games.
39 The early classical studies of the three-body confrontation problem mainly started
40 from the spatial-geometric relationship. The researchers achieved the goal of defending
41 the target by designing the spatial position of the defender with the target and the
42 attacking missile (e.g., in the middle of the target and the missile). From the line-of-sight
43 (LOS) guidance perspective, a guidance strategy for a defender guarding a target was
44 investigated that enables the defender to intercept an attacking missile at a speed and
45 maneuverability disadvantage [8]. Triangle intercept guidance is also an ingenious
46 guidance law based on the idea of LOS command guidance [9]. In order to avoid the
47 degradation of system performance or the need for additional high-resolution radar
48 assistance due to reduced angular resolution at longer distances, a simpler gain form of
49 the LOS angular rate was derived by optimal control, reducing the capability
50 requirements of the sensing equipment [10,11]. Nonlinear control approaches, such as
51 sliding mode control, can also achieve the control of LOS rotation [12].
52 The more dominant research idea for the three-body problem is by means of
53 optimal control or differential game. The difference between the two is that the guidance
54 law based on the optimal control theory needs to know the opponent's control strategy
55 in advance. Although the reliance on a priori information for one-sided optimization can
56 be reduced by sharing information between the target and the defender [13], there are
57 problems such as difficulties in applying numerical optimization algorithms online. In
58 contrast, differential game has received more widespread attention as it does not require
59 additional assumptions about the opponent's strategy [14,15]. The differential game can
60 obtain the game strategy of the two opponents by finding the saddle point solution, and
61 under the condition of accurate modeling, it can guarantee the optimality of the strategy
62 against the opponent’s arbitrary maneuver [16–18]. Considering the drawback that the
63 control of linear quadratic differential game guidance law may go beyond the boundary,
64 the bounded differential game is proposed and verified in two-dimensional plane and
65 three-dimensional space [19,20]. The differential game approach can also be applied to
66 analyze the capture and escape regions, and the Hamilton–Jacobi–Isaacs equation can be
67 solved to demonstrate the consistency of the geometric approach with the optimal
68 control approach [21–24]. Based on the analysis of the capture radius, the game can be
69 divided into different stages and the corresponding control strategies can be proposed,
70 and the conditions of stage switching are analyzed [25,26]. In addition, in order to be
6
7 Aerospace 2022, 9, x FOR PEER REVIEW 3 of 31
8
71 closer to the actual battlefield environment, recent studies have considered the existence
72 of strong constraint limits on capability boundaries [27], state estimation under
73 imperfect information through Kalman filtering [28], the existence of relative intercept
74 angle constraints on attacking requirements [17,29,30], cooperative multi-vehicle against
75 an active defense target [17,31], weapon-target-allocation strategies [32], and so on.
76 The existing studies basically have to use the linearization and order reduction of
77 the model as the basis to derive a guidance law that satisfies certain constraints and
78 performance requirements. To simplify the derivation, the vehicles are often assumed to
79 possess ideal dynamics [33,34]. However, as participating vehicles adopt more advanced
80 game strategies, the battlefield becomes more complex, and the linearization suffers
81 from significant distortion under intense maneuvering confrontations.
82 Deep reinforcement learning (DRL) developed in recent years has good adaptability
83 to complex nonlinear scenarios and shows strong potential in the aerospace field [35],
84 such as applying DRL to attitude control of hypersonic vehicles [36], design of missile
85 guidance laws [37,38], asteroid landing [39,40], vehicle path planning [41], and other
86 issues. In addition, there have been many studies applying DRL to the pursuit-evasion
87 game or TMD engagement. The problem of cooperative capture of an advanced evader
88 by multiple pursuers was studied in [42] using DRL, which is difficult for differential
89 game or optimal control in such a complex uncertain environment. In [43], the
90 researchers applied reinforcement learning algorithms to a particle environment where
91 the attacker was able to evade the defender and eventually capture the target, showing
92 better performance than traditional guidance algorithms. The agents in [42] and [43] all
93 have ideal dynamics with fewer constraints relative to the real vehicle. In [44], from the
94 perspective of the target, reinforcement learning was applied to study the timing of
95 target launching defenders, which has the potential to solve online. Deep reinforcement
96 learning was also utilized for the ballistic missile maneuvering penetration and attacking
97 stationary targets, which can also be considered as a three-body problem [6,45]. In
98 addition, adaptive dynamic programming, which is closely related to DRL, has also
99 attracted extensive interest in intelligent adversarial games [46–50]. However, the system
100 models studied so far are relatively simple, and few researches are applicable to complex
101 continuous dynamic system with multiple vehicles [51,52].
102 Motivated by the previous discussion, we apply DRL algorithms to a three-body
103 engagement, and obtain intelligent game strategies for both offensive and defensive
104 confrontations, so that both attacking missile and target/defender can combine evasion
105 and interception performance. The strategy for the attacking missile ensures that the
106 missile avoids the defender and hits the target; the strategy for the target/defender
107 ensures that the defender intercepts the missile before it threatens the target. In addition,
108 the DRL-based approach is highly adaptable to nonlinear scenarios, and thus has
109 outstanding advantages in further solving more complex multi-body adversarial
110 problems in the future. However, there also exists a gap between the simulation
9
10 Aerospace 2022, 9, x FOR PEER REVIEW 4 of 31
11
111 environment and the real world when applying DRL approaches. Simulation
112 environments can improve sampling efficiency and alleviate security issues, but
113 difficulties caused by the reality gap are encountered when transferring agent policies to
114 real devices. To address this issue, research applying DRL approaches to the aerospace
115 domain should focus on the following aspects. On the one hand, sim-to-real (Sim2Real)
116 research is used to close the reality gap and thus achieve more effective strategy transfer.
117 The main methods currently being utilized for Sim2Real transfer in DRL include domain
118 randomization, domain adaptation, imitation learning, meta-learning and knowledge
119 distillation [53]. On the other hand, in the simulation phase, the robustness and
120 generalization of the proposed methods should be fully verified. In the practical
121 application phase, the hardware-in-the-loop simulation should be conducted to
122 gradually improve the reliability of applying the proposed method to real devices.
123 In order to assist the DRL algorithm to converge more stably, we introduce
124 curriculum learning into the agent training. The concept of curriculum learning was first
125 introduced at the top conference International Conference on Machine Learning (ICML)
126 in 2009, which caused a great sensation in the field of machine learning [54]. In the
127 following decade, numerous researches on curriculum learning and Self-paced Learning
128 have been proposed.
129 The main contributions of this paper are summarized as follows.
130 (1) Combining the findings of differential game in the traditional three-body game
131 with DRL algorithms enables agent training with clearer direction, while avoiding
132 inaccuracies due to model linearization, and better adapts to complex battlefield
133 environments with stronger nonlinearity.
134 (2) The three-body adversarial game model is constructed as a Markov Decision
135 Process suitable for reinforcement learning training. Through analysis of the sign of the
136 action space and design of the reward function in the adversarial form, the combat
137 requirements of evasion and attack can be balanced in both missile and target/defender
138 training.
139 (3) The missile agent and target/defender agent are trained in a curriculum learning
140 approach to obtain intelligent game strategies for both attack and defense.
141 (4) The intelligent attack strategy enables the missile to avoid the defender and hit
142 the target in various battlefield situations and adapt to the complex environment.
143 (5) The intelligent active defense strategy enables the less capable target/defender to
144 achieve the effect similar to network adversarial attack on the missile agent. The
145 defender intercepts the attacking missile before it hits the target.
146 The paper is structured as follows. Section 2 introduces the TMD three-body
147 engagement model and presents the differential game solutions solved on the basis of
148 linearization and order reduction. In Section 3, the three-body game is constructed as a
149 Markov Decision Process with training curricula. In Section 4, the intelligent game
150 strategy for attacking missile and the intelligent game strategy for target/defender are
12
13 Aerospace 2022, 9, x FOR PEER REVIEW 5 of 31
14
151 solved separately by curriculum-based DRL. Simulation results and discussion are given
152 in Section 5, analyzing the advantages of the proposed approach. Finally, some final
153 remarks are given as conclusion in Section 6.
T
O
167
168 Figure 1. Three-body confrontation engagement geometry.
15
16 Aerospace 2022, 9, x FOR PEER REVIEW 6 of 31
17
179 where represents the internal state variables of each vehicle, and is the corresponding control input.
188 where is the zero-effort-miss of missile and target, and is the zero-effort-miss of defender and missile. The
189 coefficients and ( ) represent the effective navigation gains. The time-to-go between the
190 missile/target pair and the defender/missile pair are donated by and , respectively. The time-to-go can be
191 calculated by and , where the interception time is defined as
18
19 Aerospace 2022, 9, x FOR PEER REVIEW 7 of 31
20
192 66\*
193 MERGEFORMAT ()
194 We assume that the engagement of the attacking missile M with the defender D
195 precedes the engagement of the attacking missile M with the target T, i.e., and
196 satisfy in the timeline. This is because once the missile hits or miss the target,
197 it means that the game is over and the defender is no longer necessary to continue the
198 engagement.
199 To derive the expression for the zero-effort-miss in Eq. 5, we assume that the
200 dynamics of the vehicle in Eq. 4 is modeled as a first-order system with a time constant
201 of . Therefore, choosing the state variables as ,the equation of
202 motion for the missile to engage the target or defender can be expressed as
203 77\* MERGEFORMAT ()
204 where
206 In Eq. 8, the angles between the acceleration and the lines of sight are donated by
207 and , which can be expressed by the flight path angle and the line of sight angle.
208 Using the terminal projection transformation of the linear system, the zero-effort-miss
209 and can be expressed as
211 where is the coefficient matrix, and is the state transition matrix. The
212 derived zero-effort-miss will be used for the training of the DRL agents, the details of which are presented in Section
213 3. Up to this point, the only undetermined quantity left in Eq. 5 is the effective navigation ratios, which will become
214 the optimization variable for DRL.
21
22 Aerospace 2022, 9, x FOR PEER REVIEW 8 of 31
23
216 Applying the deep reinforcement learning algorithm to the TMD engagement
217 scenario consists of the following steps. First, the engagement environment is
218 constructed based on the dynamics model, which has been done in Section 2. Next, the
219 environment is constructed as a Markov decision process, which includes action
220 selection, redirection shaping and observation selection. This needs to be carefully
221 designed taking full account of the dynamics of the missile and the target/defender.
222 Finally, there is a learning curriculum to ensure training stability.
24
25 Aerospace 2022, 9, x FOR PEER REVIEW 9 of 31
26
254 according to the difficulty of the samples. Initially, the highest weights are given to the
255 easy samples, and as the training process continues, the weights of the harder samples
256 will be gradually increased. Such a process of dynamically assigning weights to samples
257 is called Curriculum. curriculum learning can accelerate training and reduce the training
258 iteration steps while achieving the same performance. In addition, curriculum learning
259 enables the model to obtain better generalization performance, i.e., it allows the model to
260 be trained to a better local optimum state. We will start with simple missions in our
261 training, so that the agent can easily obtain the sparse reward at the end of an episode.
262 Then the random range of initial conditions will be gradually expanded to enable the
263 agent to eventually cope with the complex environment.
264 In the following, we will construct MDP framework for the TMD engagement,
265 consisting of action selection, reward shaping, and observation selection. The
266 formulation requires adequate consideration of the dynamic model properties, as these
267 have a significant impact on the results.
276 1010\*
277 MERGEFORMAT ()
278 where is assigned to indicate an intermediate reward or penalty in an episode and is assigned to indicate a
279 terminal reward or penalty near the end of an episode. The function takes an exponential form that rises
280 exponentially as approaches zero. The parameters and regulate the rate of growth of the exponential
281 function. The general idea is to obtain continuously varying dense rewards through the exponential function.
282 However, this results in a poor differentiation of the cumulative rewards between different policies and thus affects
283 policy updates. We eventually set the reward to vary significantly as approaches 0, meaning that this will be a
284 sparse reward. The basis of the differential game formulation reduces the difficulty of training and ensures that the
285 agent completes training with sparse rewards. For both the missile and the target/defender, can be chosen as either
286 the distance or the zero-effort-miss. Note that using the zero-effort-miss in the reward function imposes no additional
287 requirements on the hardware equipment of the guidance system, as this is only used for off-line training. The
288 function takes an stairs form, and , , are the quantities associated with the kill radius.
27
28 Aerospace 2022, 9, x FOR PEER REVIEW 10 of 31
29
294 Further analysis of Eq. 5 reveals that each term in the control law is precisely in the
295 form of the classical proportional navigation guidance law [58]. Thus, each of the
296 effective navigation gains has the meaning in Table 1. Beyond the effective time, that is,
297 after the engagement between the missile and the defender, the corresponding gains are
298 set to zero.
299 To further improve the efficiency and stability of the training, we further analyze
300 the positive and negative of the effective navigation gains. From the control point of
301 view, the proportional navigation guidance law can be considered as a feedback control
302 system that regulates the zero-effort-miss to zero. Therefore, only a negative feedback
303 system can be used to avoid the divergence, as shown in Figure 2 (a). The simplest step
304 maneuver is often utilized to analyze the performance of a guidance system, and the
305 conclusion that the miss distance converges to zero with increasing flight time is
306 provided in [58].
(a) (b)
30
31 Aerospace 2022, 9, x FOR PEER REVIEW 11 of 31
32
307 Figure 2. Block diagram of proportional navigation guidance system. (a) Original guidance sys-
308 tem; (b) Adjoint guidance system.
309 Establish the adjoint system of the negative feedback guidance system, as shown in
310 Figure 2 (b). For convenience we allow to be replaced by , and from the
311 convolution integral we can get
319 ,
33
34 Aerospace 2022, 9, x FOR PEER REVIEW 12 of 31
35
36
37 Aerospace 2022, 9, x FOR PEER REVIEW 13 of 31
38
366 uncertainty and complexity of the environment. If the initial conditions are generated
367 from a completely random range at the beginning, it will be difficult to stabilize the
368 training of the agent. The curricula are set up to start training from a smaller range of
369 random initial conditions and gradually expand the randomness of the initial
370 conditions.
371 Assuming that the variable belongs to , when the total training step
372 reaches , the random range of the variable is
373 2020\*
374 MERGEFORMAT ()
375 where is the scheduling variable for curricula difficulty. The training scheduler is depicted in Figure 3, from which
376 it can be seen that the random range keeps expanding, and by the time the training step reaches , the random
377 range has basically coincided with the complete environment.
378 The growth rate of the range of random initial conditions is related to the difficulty
379 of the environment. For more difficult environments, is required to be larger. This
380 involves a trade-off between training stability and training time consumption. For
381 scenarios with difficult initial conditions, the probability distribution of random
382 numbers can be designed to adjust the curricula. In the next training, we will choose the
383 uniform distribution for initialization.
384
385 Figure 3.Curricula training scheduler curve.
39
40 Aerospace 2022, 9, x FOR PEER REVIEW 14 of 31
41
391 received wide attention include the TD3 algorithm (Twin Delayed Deep Deterministic
392 Policy Gradient) [59], the SAC algorithm (Soft Actor Critic) [60], and the PPO algorithm
393 (Proximal Policy Optimization) [61]. In this study, we adopt the PPO algorithm, which is
394 insensitive to hyperparameters, stable in the training process, and suitable for training in
395 dynamic environments with continuous action spaces.
396 At any moment , the agents take actions based on the current observation
397 from sensors and the embedded trained policy , driving the dynamic system to
398 the next state , and receiving the corresponding reward . The interaction
399 process exists until the end of the three-body game, which is called an episode. The
400 agent and environment concurrently give rise to a sequence ,
401 which is defined as a trajectory.
402 The goal of agent is to solve the optimal policy so as to maximize the expected
403 cumulative discount reward, which is usually formalized by the state-value function
404 and the state-action value function
405 2121\*
406 MERGEFORMAT ()
407 The advantage function is also calculated to estimate how advantageous an action
408 is relative to the expected optimal action under the current policy
409 2222\* MERGEFORMAT ()
410 In the PPO algorithm, the objective function expected to be maximized is
411 represented as
412 2323\*
413 MERGEFORMAT ()
414 where is a hyperparameter to restrict the size of policy updates, and the probability ratio is .
415 Equation 23 implies that the advantage function will be clipped if the probability ratio between the new policy and the
416 old policy falls outside the range and . The probability ratio measures how different the two policies are.
417 The clipped objective function ensure that excessive policy updates are avoided through clipping the estimated
418 advantage function.
42
43 Aerospace 2022, 9, x FOR PEER REVIEW 15 of 31
44
419 To further improve the performance of the algorithm, a value function loss term
420 for the estimation accuracy of critic network and an entropy maximum bonus
421 for encouraging exploration are introduced into the surrogate objective
422 2424\*
423 MERGEFORMAT ()
424 where and are corresponding coefficients. The purpose of the algorithm is to update the parameters of the
425 neural network to maximize the surrogate objective with respect to .
45
46 Aerospace 2022, 9, x FOR PEER REVIEW 16 of 31
47
452
453 Figure 4. Reward shaping for missile agent.
454 Taking and , the first two terms of the reward function can be
455 plotted as Figure 4. It can be seen that when the defender is far from the missile, the
456 reward function encourages the missile to shorten its distance from the target. As
457 decreases, the overall reward also decreases to become negative, which means that the
458 penalty dominates, so the missile’s mission at this time is mainly to evade the defender.
459 Furthermore, taking , , , and , the agent will receive a
460 decisive penalty when the defender is close to intercepting the missile.
48
49 Aerospace 2022, 9, x FOR PEER REVIEW 17 of 31
50
474 replaced by . When is small, it means that the target and the missile are already in
475 the intercept triangle, which is extremely detrimental to the target’s survival, so the
476 overall reward then decreases to a negative value. When is relatively large, it means
477 that the target is safe and the purpose of training is to improve the accuracy of the
478 defender to intercept the missile. Since the zero-effort-miss converges faster than the
479 distance, is taken, while , , and are the same as in the case of training
480 missile agent.
497 The training algorithm adopts the PPO algorithm with Actor-Critic architecture, the
498 relevant hyperparameters and neural network structure are listed in Table 3.
51
52 Aerospace 2022, 9, x FOR PEER REVIEW 18 of 31
53
Hyperparameters Value
Ratio clipping 0.3
Learning rate
Discount-rate 0.99
Buffer size
Actor network for M 8-16-16-16-2
Critic network for M 8-16-16-16-1
Actor network for T/D 8-16-16-16-4
Critic network for T/D 8-16-16-16-1
500 First, the missile agent is trained with a curriculum-based learning approach. As the
501 randomness of the initial conditions increases, the complexity of the environment and
502 the difficulty of the task grows. The target against the missile agent takes a constant
503 maneuver of random size, and the defender employs proportional navigation guidance
504 law. Then, based on the obtained attack strategy of the missile, the missile agent is
505 utilized to train the active defense strategy of the target/defender. That is, the
506 target/defender is confronted with an intelligent missile that has the ability to evade the
507 defender and attack the target from the beginning of the training. All training and
508 simulations are carried out on a computer with an Intel Xeon Gold 6152 processor and a
509 NVIDIA GeForce RTX 2080 Ti GPU. The environment and algorithm are programmed in
510 python 3.7, and the neural network is built by using PyTorch framework. Both the actor
511 network and the critic network for the missile and the target/defender contain three
512 hidden layers with 16 neurons each. The activation function of the network adopts the
513 ReLU function. If the number of multiplication and addition operations in network
514 propagation is taken to characterize the time complexity, the complexity of the actor
515 network for the missile can be calculated as 672, the complexity of the actor network for
516 the target/defender as 704 and the complexity of the two critic networks as 656. It can be
517 seen that these networks have relatively simple architectures, occupy little storage space,
518 are fast in operations (each computation taking about 0.4-0.5ms on average on a 2.1GHz
519 CPU), and therefore have the potential to be employed onboard.
520 It should be noted that the rise of the cumulative reward curve will not be accepted
521 as a criterion for training success, since the evolution of the curricula from easy to
522 difficult determines the agent can complete easy missions and thus earns high return at
523 the beginning of the training.
54
55 Aerospace 2022, 9, x FOR PEER REVIEW 19 of 31
56
526 In order to verify the effectiveness of the attack strategy of the trained missile agent,
527 we set up different scenarios to assess the agent, with the simulation conditions
528 presented in Table 3. The defender adopts a proportional navigation guidance law with
529 an effective navigation ratio of 4. The target adopts a constant maneuver with random
530 direction and magnitude. The simulation results for different target positions and
531 different defender positions are presented in Figure 5.
532 Table 4. Initial parameters for training.
533 The relative positions of the target and defender cover most typical scenarios, so the
534 simulation results are representative. Regardless of whether the missile faces a target at
535 high altitude, a target at low altitude, or a target at a comparable altitude, and regardless
536 of the direction from which the defender intercepts, the missile can avoid the defender
537 and eventually hit the target. The missile with an intelligent attack strategy will aim at
538 the target in the primary direction, but rapidly maneuvers when the defender threatens
539 itself, causing the defender to fail to intercept the missile.
57
58 Aerospace 2022, 9, x FOR PEER REVIEW 20 of 31
59
60
61 Aerospace 2022, 9, x FOR PEER REVIEW 21 of 31
62
557 missile fails. As can be seen from the zero-effort-miss in Figure 8(b), the defender’s zero-
558 effort-miss for the missile increases at the last moment due to the missile’s maneuver.
559 Then, the missile rapidly changes its acceleration direction after evading the defender,
560 thus compensating for the deviation in aiming at the target caused by the previous
561 maneuver. The zero-effort-miss of the missile to the target eventually converged to zero
562 although it experienced fluctuations due to the missile’s maneuvers.
(a) (b)
563 Figure 6. Engagement process in Figure 5(a). (a) The overload of each vehicle; (b) The zero-effort-
564 miss and .
63
64 Aerospace 2022, 9, x FOR PEER REVIEW 22 of 31
65
66
67 Aerospace 2022, 9, x FOR PEER REVIEW 23 of 31
68
69
70 Aerospace 2022, 9, x FOR PEER REVIEW 24 of 31
71
(a) (b)
629 Figure 8. Block diagram of proportional navigation guidance system. (a) Original guidance sys-
630 tem; (b) Adjoint guidance system.
72
73 Aerospace 2022, 9, x FOR PEER REVIEW 25 of 31
74
C-DRL CLQDG
Observation
Noise
3rd order 3rd order 3rd order
5% 98.4% 87.2% 93.5% 82.2% 70.0% 67.3%
15% 94.5% 86.6% 94.0% 81.0% 68.0% 66.7%
25% 95.0% 87.0% 92.5% 79.8% 53.3% 50.1%
35% 93.4% 85.7% 93.7% 79.6% 38.2% 37.3%
639 Besides, in Table 6, we compare the curriculum-based DRL approach (C-DRL) with
640 the cooperative linear quadratic differential game (CLQDG) guidance law, which is a
641 classical guidance law in the TMD scenario [55]. The gains of the CLQDG guidance law
642 do not involve response time, so we only analyze the effect of input noise and system
643 order. Since ideal dynamics is assumed in the derivation of the gains, the effect of the
644 order of the system is more pronounced. Facing the input noise, the performance of the
645 C-DRL approach decreases insignificantly, and the robustness of CLQDG is not as
646 strong as that of C-DRL approach. In addition, for complex three-dimensional multi-
647 body game problems, the differential game approach to derive analytic guidance law
648 may not work, so the reinforcement learning approach has greater potential for
649 development.
650 6. Conclusions
651 For the scenario of target-missile-defender three-body offensive and defensive
652 confrontation, intelligent game strategies using curriculum-based deep reinforcement
653 learning are proposed, including attack strategy for attacking missile and active defense
654 strategy for target/defense. The results of differential game are combined with deep
655 reinforcement learning algorithms to make the agent training with clearer direction and
656 better adapt to the complex environment with stronger nonlinearity. The three-body
657 adversarial game is constructed as MDP suitable for reinforcement learning training by
658 analyzing the sign of the action space and designing the reward function in the
659 adversarial form. The missile agent and target/defender agent are trained with a
660 curriculum learning approach to obtain the intelligent game strategies. Through
661 simulation verification we can draw the following conclusions.
662 (1) Employing the curriculum-based DRL trained attack strategy, In simulations
663 that validate the attack strategy of the missile, the missile is able to avoid the
664 defender and hit the target in various situations.
665 (2) Employing the curriculum-based DRL trained attack strategy,In simulations
666 that validate the active defense strategy of the target/defender, the less capable
667 target/defender is able to achieve an effect similar to network adversarial attack
75
76 Aerospace 2022, 9, x FOR PEER REVIEW 26 of 31
77
668 against the missile agent. The defender intercepts the missile before the it hits
669 the target.
670 (3) The intelligent game strategies are able to maintain robustness in the face of
671 disturbances from input noise and modeling inaccuracies.
672 In future research, three-dimensional scenarios with multiple attacking missiles,
673 multiple defenders, and multiple targets will be considered. The battlefield environment
674 is becoming more complicated, and the traditional differential game and weapon-target
675 assignment methods will show more obvious limitations, while the intelligent game
676 strategy based on DRL has better adaptability for complex scenarios. Motion analysis in
677 three dimensions can be conducted utilizing vector guidance laws, or decomposed into
678 two perpendicular channels and solved in the plane, as has been proven to be possible in
679 previous research. Combined with DRL, more complex multi-body game problems are
680 expected to be solved. Technologies such as self-play and adversarial attack will also be
681 applied to the generation and analysis of game strategies. In addition, considering the
682 difficulty of obtaining battlefield observations, the training algorithm needs to be
683 improved to adapt to the scenarios with imperfect information.
684 Author Contributions: The contributions of the authors are the following: Conceptualization,
685 W.C. and X.G.; methodology, X.G. and Z.C.; validation, Z.C.; formal analysis, X.G.; investigation,
686 X.G.; resources, X.G. and Z.C.; writing—original draft preparation, X.G.; writing—review and
687 editing, W.C. and Z.C.; visualization, X.G.; supervision, W.C.; project administration, W.C..;
688 funding acquisition, Z.C. All authors have read and agreed to the published version of the
689 manuscript.
690 Funding: This research was funded by China Postdoctoral Science Foundation (Grant No.
691 2021M700321).
692 Data Availability Statement: All data used during the study appear in the submitted article.
693 Acknowledgments: The study described in this paper was supported by China Postdoctoral
694 Science Foundation (Grant No. 2021M700321). The authors fully appreciate the financial support.
695 Conflicts of Interest: The authors declare no conflict of interest.
696
78
79 Aerospace 2022, 9, x FOR PEER REVIEW 27 of 31
80
697 References
698 1. Li, C.; Wang, J.; Huang, P. Optimal Cooperative Line-of-Sight Guidance for Defending a Guided Missile. Aerospace 2022, 9,
699 232, doi:10.3390/aerospace9050232.
700 2. Li, Q.; Yan, T.; Gao, M.; Fan, Y.; Yan, J. Optimal Cooperative Guidance Strategies for Aircraft Defense with Impact Angle
701 Constraints. Aerospace 2022, 9, 710, doi:10.3390/aerospace9110710.
702 3. Liang, H.; Li, Z.; Wu, J.; Zheng, Y.; Chu, H.; Wang, J. Optimal Guidance Laws for a Hypersonic Multiplayer Pursuit-Evasion
703 Game Based on a Differential Game Strategy. Aerospace 2022, 9, 97, doi:10.3390/aerospace9020097.
704 4. Shi, H.; Chen, Z.; Zhu, J.; Kuang, M. Model predictive guidance for active aircraft protection from a homing missile. IET Con-
705 trol Theory & Applications 2022, 16, 208–218, doi:10.1049/cth2.12218.
706 5. Kumar, S.R.; Mukherjee, D. Cooperative Active Aircraft Protection Guidance Using Line-of-Sight Approach. IEEE Transac-
707 tions on Aerospace and Electronic Systems 2021, 57, 957–967, doi:10.1109/TAES.2020.3046328.
708 6. Yan, M.; Yang, R.; Zhang, Y.; Yue, L.; Hu, D. A hierarchical reinforcement learning method for missile evasion and guidance.
709 Sci. Rep. 2022, 12, 18888, doi:10.1038/s41598-022-21756-6.
710 7. Liang, H.; Wang, J.; Wang, Y.; Wang, L.; Liu, P. Optimal guidance against active defense ballistic missiles via differential
711 game strategies. Chin. J. Aeronaut. 2020, 33, 978–989, doi:10.1016/j.cja.2019.12.009.
712 8. Ratnoo, A.; Shima, T. Line-of-Sight Interceptor Guidance for Defending an Aircraft. J. Guid. Control Dyn. 2011, 34, 522–532,
713 doi:10.2514/1.50572.
714 9. Yamasaki, T.; Balakrishnan, S. Triangle Intercept Guidance for Aerial Defense. In AIAA Guidance, Navigation, and Control Con-
715 ference; American Institute of Aeronautics and Astronautics, 2010.
716 10. Yamasaki, T.; Balakrishnan, S.N.; Takano, H. Modified Command to Line-of-Sight Intercept Guidance for Aircraft Defense. J.
717 Guid. Control Dyn. 2013, 36, 898–902, doi:10.2514/1.58566.
718 11. Yamasaki, T.; Balakrishnan, S.N. Intercept Guidance for Cooperative Aircraft Defense against a Guided Missile. IFAC Proceed-
719 ings Volumes 2010, 43, 118–123, doi:10.3182/20100906-5-JP-2022.00021.
720 12. Liu, S.; Wang, Y.; Li, Y.; Yan, B.; Zhang, T. Cooperative guidance for active defence based on line-of-sight constraint under a
721 low-speed ratio. The Aeronautical Journal 2022, 1–19, doi:10.1017/aer.2022.62.
722 13. Shaferman, V.; Oshman, Y. Stochastic Cooperative Interception Using Information Sharing Based on Engagement Staggering.
723 J. Guid. Control Dyn. 2016, 39, 2127–2141, doi:10.2514/1.G000437.
724 14. Prokopov, O.; Shima, T. Linear Quadratic Optimal Cooperative Strategies for Active Aircraft Protection. J. Guid. Control Dyn.
725 2013, 36, 753–764, doi:10.2514/1.58531.
726 15. Shima, T. Optimal Cooperative Pursuit and Evasion Strategies Against a Homing Missile. J. Guid. Control Dyn. 2011, 34, 414–
727 425, doi:10.2514/1.51765.
81
82 Aerospace 2022, 9, x FOR PEER REVIEW 28 of 31
83
728 16. Alkaher, D.; Moshaiov, A. Game-Based Safe Aircraft Navigation in the Presence of Energy-Bleeding Coasting Missile. J. Guid.
729 Control Dyn. 2016, 39, 1539–1550, doi:10.2514/1.G001676.
730 17. Liu, F.; Dong, X.; Li, Q.; Ren, Z. Cooperative differential games guidance laws for multiple attackers against an active defense
731 target. Chinese Journal of Aeronautics 2022, 35, 374–389, doi:10.1016/j.cja.2021.07.033.
732 18. Chen, W.; Cheng, C.; Jin, B.; Xu, Z. Research on differential game guidance law for intercepting hypersonic vehicles. In 6th In-
733 ternational Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022). 6th International Workshop on Ad-
734 vanced Algorithms and Control Engineering (IWAACE 2022), Qingdao, China, 2022/7/8 - 2022/7/10; Qiu, D., Ye, X., Sun, N.,
735 Eds.; SPIE, 2022; p 94, ISBN 9781510657755.
736 19. Rubinsky, S.; Gutman, S. Three-Player Pursuit and Evasion Conflict. J. Guid. Control Dyn. 2014, 37, 98–110,
737 doi:10.2514/1.61832.
738 20. Rubinsky, S.; Gutman, S. Vector Guidance Approach to Three-Player Conflict in Exoatmospheric Interception. J. Guid. Control
739 Dyn. 2015, 38, 2270–2286, doi:10.2514/1.G000942.
740 21. Garcia, E.; Casbeer, D.W.; Pachter, M. Pursuit in the Presence of a Defender. Dyn Games Appl 2019, 9, 652–670, doi:10.1007/
741 s13235-018-0271-9.
742 22. Garcia, E.; Casbeer, D.W.; Pachter, M. The Complete Differential Game of Active Target Defense. J Optim Theory Appl 2021,
743 191, 675–699, doi:10.1007/s10957-021-01816-z.
744 23. Garcia, E.; Casbeer, D.W.; Fuchs, Z.E.; Pachter, M. Cooperative Missile Guidance for Active Defense of Air Vehicles. IEEE
745 Trans. Aerosp. Electron. Syst. 2018, 54, 706–721, doi:10.1109/TAES.2017.2764269.
746 24. Garcia, E.; Casbeer, D.W.; Pachter, M. Design and Analysis of State-Feedback Optimal Strategies for the Differential Game of
747 Active Defense. IEEE Trans. Autom. Control 2018, 64, 1, doi:10.1109/TAC.2018.2828088.
748 25. Liang, L.; Deng, F.; Lu, M.; Chen, J. Analysis of Role Switch for Cooperative Target Defense Differential Game. IEEE Trans.
749 Autom. Control 2021, 66, 902–909, doi:10.1109/TAC.2020.2987701.
750 26. Liang, L.; Deng, F.; Peng, Z.; Li, X.; Zha, W. A differential game for cooperative target defense. Automatica 2019, 102, 58–71,
751 doi:10.1016/j.automatica.2018.12.034.
752 27. Qi, N.; Sun, Q.; Zhao, J. Evasion and pursuit guidance law against defended target. Chin. J. Aeronaut. 2017, 30, 1958–1973,
753 doi:10.1016/j.cja.2017.06.015.
754 28. Shaferman, V.; Shima, T. Cooperative Multiple-Model Adaptive Guidance for an Aircraft Defending Missile. J. Guid. Control
755 Dyn. 2010, 33, 1801–1813, doi:10.2514/1.49515.
756 29. Shaferman, V.; Shima, T. Cooperative Differential Games Guidance Laws for Imposing a Relative Intercept Angle. J. Guid.
757 Control Dyn. 2017, 40, 2465–2480, doi:10.2514/1.G002594.
84
85 Aerospace 2022, 9, x FOR PEER REVIEW 29 of 31
86
758 30. Saurav, A.; Kumar, S.R.; Maity, A. Cooperative Guidance Strategies for Aircraft Defense with Impact Angle Constraints. In
759 AIAA Scitech 2019 Forum. AIAA Scitech 2019 Forum, San Diego, California, 2019/01/07; American Institute of Aeronautics and
760 Astronautics: Reston, Virginia, 2019, ISBN 978-1-62410-578-4.
761 31. Liang, H.; Wang, J.; Liu, J.; Liu, P. Guidance strategies for interceptor against active defense spacecraft in two-on-two engage-
762 ment. Aerosp. Sci. Technol. 2020, 96, 105529, doi:10.1016/j.ast.2019.105529.
763 32. Shalumov, V.; Shima, T. Weapon–Target-Allocation Strategies in Multiagent Target–Missile–Defender Engagement. J. Guid.
764 Control Dyn. 2017, 40, 2452–2464, doi:10.2514/1.G002598.
765 33. Sun, Q.; Qi, N.; Xiao, L.; Lin, H. Differential game strategy in three-player evasion and pursuit scenarios. J. Syst. Eng. Electron.
766 2018, 29, 352–366, doi:10.21629/JSEE.2018.02.16.
767 34. Sun, Q.; Zhang, C.; Liu, N.; Zhou, W.; Qi, N. Guidance laws for attacking defended target. Chin. J. Aeronaut. 2019, 32, 2337–
768 2353, doi:10.1016/j.cja.2019.05.011.
769 35. Chai, R.; Tsourdos, A.; Savvaris, A.; Chai, S.; Xia, Y.; Philip Chen, C.L. Review of advanced guidance and control algorithms
770 for space/aerospace vehicles. Progress in Aerospace Sciences 2021, 122, 100696, doi:10.1016/j.paerosci.2021.100696.
771 36. Liu, Y.; Wang, H.; Wu, T.; Lun, Y.; Fan, J.; Wu, J. Attitude control for hypersonic reentry vehicles: An efficient deep reinforce-
772 ment learning method. Appl. Soft Comput. 2022, 123, 108865, doi:10.1016/j.asoc.2022.108865.
773 37. Gaudet, B.; Furfaro, R.; Linares, R. Reinforcement learning for angle-only intercept guidance of maneuvering targets. Aerosp.
774 Sci. Technol. 2020, 99, 1–10, doi:10.1016/j.ast.2020.105746.
775 38. He, S.; Shin, H.-S.; Tsourdos, A. Computational Missile Guidance: A Deep Reinforcement Learning Approach. Journal of Aero-
776 space Information Systems 2021, 18, 571–582, doi:10.2514/1.I010970.
777 39. Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing
778 via a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171, doi:10.1016/j.actaastro.2020.02.051.
779 40. Gaudet, B.; Linares, R.; Furfaro, R. Adaptive guidance and integrated navigation with reinforcement meta-learning. Acta As-
780 tronaut. 2020, 169, 180–190, doi:10.1016/j.actaastro.2020.01.007.
781 41. He, L.; Aouf, N.; Song, B. Explainable Deep Reinforcement Learning for UAV autonomous path planning. Aerosp. Sci. Technol.
782 2021, 118, 107052, doi:10.1016/j.ast.2021.107052.
783 42. Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit-evasion games with reinforcement learning. Neuro-
784 computing 2020, 412, 101–114, doi:10.1016/j.neucom.2020.06.031.
785 43. English, J.T.; Wilhelm, J.P. Defender-Aware Attacking Guidance Policy for the Target–Attacker–Defender Differential Game.
786 Journal of Aerospace Information Systems 2021, 18, 366–376, doi:10.2514/1.I010877.
787 44. Shalumov, V. Cooperative online Guide-Launch-Guide policy in a target-missile-defender engagement using deep reinforce-
788 ment learning. Aerosp. Sci. Technol. 2020, 104, 105996, doi:10.1016/j.ast.2020.105996.
87
88 Aerospace 2022, 9, x FOR PEER REVIEW 30 of 31
89
789 45. Qiu, X.; Gao, C.; Jing, W. Maneuvering penetration strategies of ballistic missiles based on deep reinforcement learning. Proc.
790 Inst. Mech. Eng., Part G: J. Aerosp. Eng. 2022, 095441002210883, doi:10.1177/09544100221088361.
791 46. Radac, M.-B.; Lala, T. Robust Control of Unknown Observable Nonlinear Systems Solved as a Zero-Sum Game. IEEE Access
792 2020, 8, 214153–214165, doi:10.1109/ACCESS.2020.3040185.
793 47. Zhao, M.; Wang, D.; Ha, M.; Qiao, J. Evolving and Incremental Value Iteration Schemes for Nonlinear Discrete-Time Zero-
794 Sum Games. IEEE Trans. Cybern. 2022, PP, doi:10.1109/TCYB.2022.3198078.
795 48. Xue, S.; Luo, B.; Liu, D. Event-Triggered Adaptive Dynamic Programming for Zero-Sum Game of Partially Unknown Contin-
796 uous-Time Nonlinear Systems. IEEE Trans. Syst. Man Cybern, Syst. 2020, 50, 3189–3199, doi:10.1109/TSMC.2018.2852810.
797 49. Wei, Q.; Liu, D.; Lin, Q.; Song, R. Adaptive Dynamic Programming for Discrete-Time Zero-Sum Games. IEEE Trans. Neural
798 Networks Learn. Syst. 2018, 29, 957–969, doi:10.1109/TNNLS.2016.2638863.
799 50. Zhu, Y.; Zhao, D.; Li, X. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based
800 on Online Data. IEEE Trans. Neural Networks Learn. Syst. 2017, 28, 714–725, doi:10.1109/TNNLS.2016.2561300.
801 51. Jiang, H.; Zhang, H.; Han, J.; Zhang, K. Iterative adaptive dynamic programming methods with neural network implementa-
802 tion for multi-player zero-sum games. Neurocomputing 2018, 307, 54–60, doi:10.1016/j.neucom.2018.04.005.
803 52. Wang, W.; Chen, X.; Du, J. Model-free finite-horizon optimal control of discrete-time two-player zero-sum games. Interna-
804 tional Journal of Systems Science 2023, 54, 167–179, doi:10.1080/00207721.2022.2111236.
805 53. Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey. In 2020
806 IEEE Symposium Series on Computational Intelligence (SSCI). 2020 IEEE Symposium Series on Computational Intelligence
807 (SSCI), Canberra, ACT, Australia, 2020/12/1 - 2020/12/4; IEEE, 2020; pp 737–744, ISBN 978-1-7281-2547-3.
808 54. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Confer-
809 ence on Machine Learning - ICML '09. the 26th Annual International Conference, Montreal, Quebec, Canada, 2009/6/14 -
810 2009/6/18; Danyluk, A., Bottou, L., Littman, M., Eds.; ACM Press: New York, New York, USA, 2009; pp 1–8, ISBN
811 9781605585161.
812 55. Perelman, A.; Shima, T.; Rusnak, I. Cooperative Differential Games Strategies for Active Aircraft Protection from a Homing
813 Missile. J. Guid. Control Dyn. 2011, 34, 761–773, doi:10.2514/1.51611.
814 56. Wang, X.; Chen, Y.; Zhu, W. A Survey on Curriculum Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4555–4576,
815 doi:10.1109/TPAMI.2021.3069908.
816 57. Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum Learning: A Survey. Int. J. Comput. Vis. 2022, 130, 1526–1565,
817 doi:10.1007/s11263-022-01611-x.
818 58. Zarchan, P. Tactical and strategic missile guidance, 6th ed.; American Institute of Aeronautics and Astronautics: Reston, Va.,
819 2012, ISBN 978-1-60086-894-8.
90
91 Aerospace 2022, 9, x FOR PEER REVIEW 31 of 31
92
820 59. Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the
821 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR, 2018; pp 1587–1596.
822 60. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
823 with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR,
824 2018; pp 1861–1870.
825 61. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms, 2017. Available online:
826 https://arxiv.org/abs/1707.06347v2.
827 62. Liu, F.; Dong, X.; Li, Q.; Ren, Z. Robust multi-agent differential games with application to cooperative guidance. Aerosp. Sci.
828 Technol. 2021, 111, 106568, doi:10.1016/j.ast.2021.106568.
829 63. Wei, X.; Yang, J. Optimal Strategies for Multiple Unmanned Aerial Vehicles in a Pursuit/Evasion Differential Game. J. Guid.
830 Control Dyn. 2018, 41, 1799–1806, doi:10.2514/1.G003480.
831 64. Shaferman, V.; Shima, T. Cooperative Optimal Guidance Laws for Imposing a Relative Intercept Angle. J. Guid. Control Dyn.
832 2015, 38, 1395–1408, doi:10.2514/1.G000568.
833 65. Ilahi, I.; Usama, M.; Qadir, J.; Janjua, M.U.; Al-Fuqaha, A.; Hoang, D.T.; Niyato, D. Challenges and Countermeasures for Ad-
834 versarial Attacks on Deep Reinforcement Learning. IEEE Trans. Artif. Intell. 2022, 3, 90–109, doi:10.1109/TAI.2021.3111139.
835 66. Qiu, S.; Liu, Q.; Zhou, S.; Wu, C. Review of Artificial Intelligence Adversarial Attack and Defense Technologies. Appl. Sci.
836 (Basel) 2019, 9, 909, doi:10.3390/app9050909.
837
93