Aerospace 2128999

1
1 111 公式章 1 节 1Article
2 Intelligent Game Strategies in Target-Missile-Defender

3 Engagement using Curriculum-based Deep Reinforcement
4 Learning
5 Xiaopeng Gong 1, Wanchun Chen 1 and Zhongyuan Chen 1,*
6 1
School of Astronautics, Beihang University, Beijing 100191, China
7 * Correspondence: zhongyuan@buaa.edu.cn
8 Abstract: Aiming at the attack and defense game problem in the target-missile-defender three-
9 body confrontation scenario, intelligent game strategies based on deep reinforcement learning is
10 proposed, including attack strategy applicable to attacking missile and active defense strategy
11 applicable to target/defender. First, based on the classical three-body adversarial research, the
12 reinforcement learning algorithm is introduced to improve the purposefulness of the algorithm
13 training. The action space the reward and punishment conditions of both attack and defense
14 confrontation are considered in the reward function design. Through analysis of the sign of the
15 action space and design of the reward function in the adversarial form, the combat requirements
16 can be satisfied in both missile and target/defender training. Then, a curriculum-based deep
17 reinforcement learning algorithm is applied to train the agents, and a convergent game strategy is
18 obtained. The simulation results show that the attack strategy of the missile can maneuver
19 according to the battlefield situation, and can successfully hit the target after avoiding the
20 defender. The active defense strategy enables the less capable target/defender to achieve the effect
21 Citation: To be added by editorial similar to network adversarial attack on the missile agent, shielding targets from attack against
22 staff during production. missile with superior maneuverability on the battlefield.
Academic Editor: Firstname
23 Lastname Keywords: target-missile-defender engagement; three-body game; curriculum learning; deep
24 reinforcement learning; intelligent game; active defense
Received: date
25
Accepted: date
Published: date
26 Publisher’s Note: MDPI stays 1. Introduction

neutral with regard to jurisdictional
27 In recent years, with the development of weapons technology, the offensive and
claims in published maps and
28 defensive confrontation scenarios have become increasingly complex. The traditional
institutional affiliations.
29 one-to-one game problem is also difficult to keep up with the trend of battlefield
30 intelligence. In various new researches, both sides of the confrontation continuously
Copyright: © 2022 by the authors.

Submitted for possible open access
publication under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
3 Aerospace 2022, 9, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/aerospace

4 Aerospace 2022, 9, x FOR PEER REVIEW 2 of 31
5
31 adopt new game strategies so as to gain battlefield advantages. Among them, the target-
32 missile-defender (TMD) three-body engagement triggered by active target defense has
33 attracted increasing research interest [1–7]. In a typical three-body confrontation
34 scenario, three types of vehicles are involved: the target (usually a high-value vehicle
35 such as an aircraft or ballistic missile), an attacking missile to attack the target, and a
36 defender missile to intercept the attacking missile. This combat scenario breaks the
37 traditional pursuit-evasion model with greater complexity, and provides more
38 possibilities for battlefield games.
39 The early classical studies of the three-body confrontation problem mainly started
40 from the spatial-geometric relationship. The researchers achieved the goal of defending
41 the target by designing the spatial position of the defender with the target and the
42 attacking missile (e.g., in the middle of the target and the missile). From the line-of-sight
43 (LOS) guidance perspective, a guidance strategy for a defender guarding a target was
44 investigated that enables the defender to intercept an attacking missile at a speed and
45 maneuverability disadvantage [8]. Triangle intercept guidance is also an ingenious
46 guidance law based on the idea of LOS command guidance [9]. In order to avoid the
47 degradation of system performance or the need for additional high-resolution radar
48 assistance due to reduced angular resolution at longer distances, a simpler gain form of
49 the LOS angular rate was derived by optimal control, reducing the capability
50 requirements of the sensing equipment [10,11]. Nonlinear control approaches, such as
51 sliding mode control, can also achieve the control of LOS rotation [12].
52 The more dominant research idea for the three-body problem is by means of
53 optimal control or differential game. The difference between the two is that the guidance
54 law based on the optimal control theory needs to know the opponent's control strategy
55 in advance. Although the reliance on a priori information for one-sided optimization can
56 be reduced by sharing information between the target and the defender [13], there are
57 problems such as difficulties in applying numerical optimization algorithms online. In
58 contrast, differential game has received more widespread attention as it does not require
59 additional assumptions about the opponent's strategy [14,15]. The differential game can
60 obtain the game strategy of the two opponents by finding the saddle point solution, and
61 under the condition of accurate modeling, it can guarantee the optimality of the strategy
62 against the opponent’s arbitrary maneuver [16–18]. Considering the drawback that the
63 control of linear quadratic differential game guidance law may go beyond the boundary,
64 the bounded differential game is proposed and verified in two-dimensional plane and
65 three-dimensional space [19,20]. The differential game approach can also be applied to
66 analyze the capture and escape regions, and the Hamilton–Jacobi–Isaacs equation can be
67 solved to demonstrate the consistency of the geometric approach with the optimal
68 control approach [21–24]. Based on the analysis of the capture radius, the game can be
69 divided into different stages and the corresponding control strategies can be proposed,
70 and the conditions of stage switching are analyzed [25,26]. In addition, in order to be
6
8
71 closer to the actual battlefield environment, recent studies have considered the existence
72 of strong constraint limits on capability boundaries [27], state estimation under
73 imperfect information through Kalman filtering [28], the existence of relative intercept
74 angle constraints on attacking requirements [17,29,30], cooperative multi-vehicle against
75 an active defense target [17,31], weapon-target-allocation strategies [32], and so on.
76 The existing studies basically have to use the linearization and order reduction of
77 the model as the basis to derive a guidance law that satisfies certain constraints and
78 performance requirements. To simplify the derivation, the vehicles are often assumed to
79 possess ideal dynamics [33,34]. However, as participating vehicles adopt more advanced
80 game strategies, the battlefield becomes more complex, and the linearization suffers
81 from significant distortion under intense maneuvering confrontations.
82 Deep reinforcement learning (DRL) developed in recent years has good adaptability
83 to complex nonlinear scenarios and shows strong potential in the aerospace field [35],
84 such as applying DRL to attitude control of hypersonic vehicles [36], design of missile
85 guidance laws [37,38], asteroid landing [39,40], vehicle path planning [41], and other
86 issues. In addition, there have been many studies applying DRL to the pursuit-evasion
87 game or TMD engagement. The problem of cooperative capture of an advanced evader
88 by multiple pursuers was studied in [42] using DRL, which is difficult for differential
89 game or optimal control in such a complex uncertain environment. In [43], the
90 researchers applied reinforcement learning algorithms to a particle environment where
91 the attacker was able to evade the defender and eventually capture the target, showing
92 better performance than traditional guidance algorithms. The agents in [42] and [43] all
93 have ideal dynamics with fewer constraints relative to the real vehicle. In [44], from the
94 perspective of the target, reinforcement learning was applied to study the timing of
95 target launching defenders, which has the potential to solve online. Deep reinforcement
96 learning was also utilized for the ballistic missile maneuvering penetration and attacking
97 stationary targets, which can also be considered as a three-body problem [6,45]. In
98 addition, adaptive dynamic programming, which is closely related to DRL, has also
99 attracted extensive interest in intelligent adversarial games [46–50]. However, the system
100 models studied so far are relatively simple, and few researches are applicable to complex
101 continuous dynamic system with multiple vehicles [51,52].
102 Motivated by the previous discussion, we apply DRL algorithms to a three-body
103 engagement, and obtain intelligent game strategies for both offensive and defensive
104 confrontations, so that both attacking missile and target/defender can combine evasion
105 and interception performance. The strategy for the attacking missile ensures that the
106 missile avoids the defender and hits the target; the strategy for the target/defender
107 ensures that the defender intercepts the missile before it threatens the target. In addition,
108 the DRL-based approach is highly adaptable to nonlinear scenarios, and thus has
109 outstanding advantages in further solving more complex multi-body adversarial
110 problems in the future. However, there also exists a gap between the simulation
9
11
111 environment and the real world when applying DRL approaches. Simulation
112 environments can improve sampling efficiency and alleviate security issues, but
113 difficulties caused by the reality gap are encountered when transferring agent policies to
114 real devices. To address this issue, research applying DRL approaches to the aerospace
115 domain should focus on the following aspects. On the one hand, sim-to-real (Sim2Real)
116 research is used to close the reality gap and thus achieve more effective strategy transfer.
117 The main methods currently being utilized for Sim2Real transfer in DRL include domain
118 randomization, domain adaptation, imitation learning, meta-learning and knowledge
119 distillation [53]. On the other hand, in the simulation phase, the robustness and
120 generalization of the proposed methods should be fully verified. In the practical
121 application phase, the hardware-in-the-loop simulation should be conducted to
122 gradually improve the reliability of applying the proposed method to real devices.
123 In order to assist the DRL algorithm to converge more stably, we introduce
124 curriculum learning into the agent training. The concept of curriculum learning was first
125 introduced at the top conference International Conference on Machine Learning (ICML)
126 in 2009, which caused a great sensation in the field of machine learning [54]. In the
127 following decade, numerous researches on curriculum learning and Self-paced Learning
128 have been proposed.
129 The main contributions of this paper are summarized as follows.
130 (1) Combining the findings of differential game in the traditional three-body game
131 with DRL algorithms enables agent training with clearer direction, while avoiding
132 inaccuracies due to model linearization, and better adapts to complex battlefield
133 environments with stronger nonlinearity.
134 (2) The three-body adversarial game model is constructed as a Markov Decision
135 Process suitable for reinforcement learning training. Through analysis of the sign of the
136 action space and design of the reward function in the adversarial form, the combat
137 requirements of evasion and attack can be balanced in both missile and target/defender
138 training.
139 (3) The missile agent and target/defender agent are trained in a curriculum learning
140 approach to obtain intelligent game strategies for both attack and defense.
141 (4) The intelligent attack strategy enables the missile to avoid the defender and hit
142 the target in various battlefield situations and adapt to the complex environment.
143 (5) The intelligent active defense strategy enables the less capable target/defender to
144 achieve the effect similar to network adversarial attack on the missile agent. The
145 defender intercepts the attacking missile before it hits the target.
146 The paper is structured as follows. Section 2 introduces the TMD three-body
147 engagement model and presents the differential game solutions solved on the basis of
148 linearization and order reduction. In Section 3, the three-body game is constructed as a
149 Markov Decision Process with training curricula. In Section 4, the intelligent game
150 strategy for attacking missile and the intelligent game strategy for target/defender are
12
14
151 solved separately by curriculum-based DRL. Simulation results and discussion are given
152 in Section 5, analyzing the advantages of the proposed approach. Finally, some final
153 remarks are given as conclusion in Section 6.
154 2. Dynamic Model of TMD Engagement
155 2.1. Nonlinear Engagement Model

156 The TMD three-body engagement involves an offensive and defensive
157 confrontation, including an attacking missile (M) on one side, and a target (T) and a
158 defender (D) on the other side. The mission of the missile is to attack the target, but the
159 defender will be launched by the target or other platforms to intercept the missile, so the
160 missile is required to evade the defender by maneuvering before attempting to hit the
161 target. The mission of the target/defender is the opposite. In general, the target is weak
162 in maneuvering and has difficulty avoiding being hit by the missile through traditional
163 maneuvering strategies, so the target adopts an active defense strategy of firing defender
164 to intercept the missile and survive in the battlefield. The engagement geometry of the
165 TMD three-body confrontation in the inertial coordinate system is shown in
166 Figure 1.
T
O
167
168 Figure 1. Three-body confrontation engagement geometry.
169 As shown in Figure 1, the nonlinear engagement model of missile-target and

170 missile-defender can be represented as
15
17
171 22\* MERGEFORMAT

172 ()
173 where M stands for the missile, and represents the target or defender, i.e., .
174 The rate of change of flight path angle can be expressed as
175 33\* MERGEFORMAT ()

176 The dynamics model of each vehicle can be represented by a linear equation of
177 arbitrary order as
179 where represents the internal state variables of each vehicle, and is the corresponding control input.
180 2.1. Linearization and Zero-Effort-Miss

181 The three-body engagement generally occurs in the end-game phase, when the
182 relative speeds of the offensive and defensive confrontations are large, the engagement
183 time is short, and the speed of each vehicle can be approximated as a constant.
184 According to [55], by linearizing the engagement geometry near the initial lines of sight
185 and applying differential game theory, the optimal control of each vehicle under a
186 quadratic cost function can be found as
188 where is the zero-effort-miss of missile and target, and is the zero-effort-miss of defender and missile. The
189 coefficients and ( ) represent the effective navigation gains. The time-to-go between the
190 missile/target pair and the defender/missile pair are donated by and , respectively. The time-to-go can be
191 calculated by and , where the interception time is defined as
18
20
192 66\*
193 MERGEFORMAT ()
194 We assume that the engagement of the attacking missile M with the defender D
195 precedes the engagement of the attacking missile M with the target T, i.e., and
196 satisfy in the timeline. This is because once the missile hits or miss the target,
197 it means that the game is over and the defender is no longer necessary to continue the
198 engagement.
199 To derive the expression for the zero-effort-miss in Eq. 5, we assume that the
200 dynamics of the vehicle in Eq. 4 is modeled as a first-order system with a time constant
201 of . Therefore, choosing the state variables as ,the equation of
202 motion for the missile to engage the target or defender can be expressed as
204 where
206 In Eq. 8, the angles between the acceleration and the lines of sight are donated by
207 and , which can be expressed by the flight path angle and the line of sight angle.
208 Using the terminal projection transformation of the linear system, the zero-effort-miss
209 and can be expressed as
211 where is the coefficient matrix, and is the state transition matrix. The
212 derived zero-effort-miss will be used for the training of the DRL agents, the details of which are presented in Section
213 3. Up to this point, the only undetermined quantity left in Eq. 5 is the effective navigation ratios, which will become
214 the optimization variable for DRL.
215 3. Curriculum-Based DRL Algorithm
21
23
216 Applying the deep reinforcement learning algorithm to the TMD engagement
217 scenario consists of the following steps. First, the engagement environment is
218 constructed based on the dynamics model, which has been done in Section 2. Next, the
219 environment is constructed as a Markov decision process, which includes action
220 selection, redirection shaping and observation selection. This needs to be carefully
221 designed taking full account of the dynamics of the missile and the target/defender.
222 Finally, there is a learning curriculum to ensure training stability.
223 3.1. Deep Reinforcement Learning and Curriculum Learning

224 Reinforcement learning, as a branch of machine learning, has received a lot of
225 attention from researchers in various fields in recent years. Classical reinforcement
226 learning is used to solve Markov Decision Process (MDP) of dynamic interaction
227 between an agent with the environment, which consists of a five-tuple ,
228 where and denote the state space and action space, denotes the
229 probability matrix of state transfer, denotes the immediate
230 reward, and denotes the reward discount factor. In the MDP, the immediate
231 reward and the next state only depend on the current state and action, which is called
232 Markov property. The solving process of a dynamic system through integral is
233 essentially consistent with the MDP.
234 Benefiting from the rapid development of deep learning, reinforcement learning has
235 achieved abundant achievements in recent years and developed into deep reinforcement
236 learning. However, DRL is often plagued by reward sparsity and excessive action-state
237 space in training. In the TMD engagement, we are concerned with the terminal miss
238 distance and not with the intermediate processes. Therefore, the terminal reward in the
239 reward function dominates absolutely, which is similar to the terminal performance
240 index in the optimal control problem. Thus, the reward function in the guidance
241 problem is typically sparse, otherwise the dense intermediate reward may lead to
242 speculative strategies that the designer does not expect. Furthermore, despite the clear
243 problem definition and optimization goals, the nearly infinite action-state space and the
244 huge random initial conditions still pose obvious difficulties for the agent training. In
245 particular, random conditions such as the position, speed, and heading error of each
246 vehicle at the beginning of the engagement add uncertainty to the training.
247 To solve this problem, we use a curriculum learning approach to ensure the steady
248 progress of training. The learning process of humans and animals generally follows a
249 sequence from easy to difficult, and curriculum learning draws on this learning idea. In
250 contrast to the general paradigm of indiscriminate machine learning, curriculum
251 learning mimics the process of human learning by proposing that models start with easy
252 tasks and gradually progress to complex samples and knowledge [56,57]. Ccurriculum
253 learning assigns different weights to the training samples of different difficulty levels
24
26
254 according to the difficulty of the samples. Initially, the highest weights are given to the
255 easy samples, and as the training process continues, the weights of the harder samples
256 will be gradually increased. Such a process of dynamically assigning weights to samples
257 is called Curriculum. curriculum learning can accelerate training and reduce the training
258 iteration steps while achieving the same performance. In addition, curriculum learning
259 enables the model to obtain better generalization performance, i.e., it allows the model to
260 be trained to a better local optimum state. We will start with simple missions in our
261 training, so that the agent can easily obtain the sparse reward at the end of an episode.
262 Then the random range of initial conditions will be gradually expanded to enable the
263 agent to eventually cope with the complex environment.
264 In the following, we will construct MDP framework for the TMD engagement,
265 consisting of action selection, reward shaping, and observation selection. The
266 formulation requires adequate consideration of the dynamic model properties, as these
267 have a significant impact on the results.
268 3.2. Reward Shaping

269 For the design of the reward function, consider the engagement as the process of
270 confrontation game between the missile and the target/defender, and the advantage of
271 one side on the battlefield is correspondingly expressed as the disadvantage of the other
272 side. Therefore, the reward function should reflect the combat intention of both sides of
273 the game, including positive rewards and negative penalties, and accordingly, the
274 rewards and penalties of one side show the punishments and rewards of the other side.
275 We design the following two forms of reward functions
276 1010\*
277 MERGEFORMAT ()
278 where is assigned to indicate an intermediate reward or penalty in an episode and is assigned to indicate a
279 terminal reward or penalty near the end of an episode. The function takes an exponential form that rises
280 exponentially as approaches zero. The parameters and regulate the rate of growth of the exponential
281 function. The general idea is to obtain continuously varying dense rewards through the exponential function.
282 However, this results in a poor differentiation of the cumulative rewards between different policies and thus affects
283 policy updates. We eventually set the reward to vary significantly as approaches 0, meaning that this will be a
284 sparse reward. The basis of the differential game formulation reduces the difficulty of training and ensures that the
285 agent completes training with sparse rewards. For both the missile and the target/defender, can be chosen as either
286 the distance or the zero-effort-miss. Note that using the zero-effort-miss in the reward function imposes no additional
287 requirements on the hardware equipment of the guidance system, as this is only used for off-line training. The
288 function takes an stairs form, and , , are the quantities associated with the kill radius.
27
29
289 3.3. Action Selection

290 According to the derived Eq. 5, when training the missile agent, the action is chosen
291 as a two-dimensional vector ; when training the target/defender agent, the
292 action is chosen as a four-dimensional vector .
293 Table 1. Meanings of the effective navigation gains.
Gain Meaning Effective Time

Responsible for pursuing the target, i.e., decreasing
Responsible for avoiding the defender, i.e., increasing
Responsible for avoiding the missile, i.e., increasing
Responsible for assisting the defender in pursuing the missile, i.e., decreasing
Responsible for assisting the target in avoiding the missile, i.e., increasing
Responsible for pursuing the missile, i.e., decreasing
294 Further analysis of Eq. 5 reveals that each term in the control law is precisely in the
295 form of the classical proportional navigation guidance law [58]. Thus, each of the
296 effective navigation gains has the meaning in Table 1. Beyond the effective time, that is,
297 after the engagement between the missile and the defender, the corresponding gains are
298 set to zero.
299 To further improve the efficiency and stability of the training, we further analyze
300 the positive and negative of the effective navigation gains. From the control point of
301 view, the proportional navigation guidance law can be considered as a feedback control
302 system that regulates the zero-effort-miss to zero. Therefore, only a negative feedback
303 system can be used to avoid the divergence, as shown in Figure 2 (a). The simplest step
304 maneuver is often utilized to analyze the performance of a guidance system, and the
305 conclusion that the miss distance converges to zero with increasing flight time is
306 provided in [58].


(a) (b)
30
32
307 Figure 2. Block diagram of proportional navigation guidance system. (a) Original guidance sys-
308 tem; (b) Adjoint guidance system.
309 Establish the adjoint system of the negative feedback guidance system, as shown in
310 Figure 2 (b). For convenience we allow to be replaced by , and from the
311 convolution integral we can get

313 ()
314 Converting Eq.11 from time to the frequency domain, we obtain

316 Next, integrating the preceding equation yields

318 When the guidance system is a single-lag system, which means that
319 ,

321 we can finally obtain the expression for the miss distance of the negative feedback
322 guidance system in the frequency domain as

324 ()
325 Applying the final value theorem, it can be analyzed that when the flight time
326 increases, the miss distance will tend to zero

328 which means that the guidance system is stable and controllable. Similarly, we can find the expression of the miss
329 distance for the positive feedback guidance system in the frequency domain as follows

331 Again, applying the final value theorem, it can be found that the miss distance does
332 not converge with increasing flight time, but instead diverges to infinity
33
35

334 This conclusion is obvious from the control point of view, since positive feedback
335 systems are generally not adopted because of their divergence characteristics. Therefore,
336 positive feedback is never used in proportional navigation guidance systems, and the
337 effective guidance gain is never set to be negative. However, now we are faced with a
338 situation where wants to decrease , and want to decrease , but
339 wants to increase , and want to increase . Therefore, combining the
340 properties of negative and positive feedback systems, we set the actions , , and
341 to be positive and , , and to be negative.
342 3.4. Observation Selection

343 During the flight of a vehicle, not all states are meaningful for the design of the
344 guidance law, nor all states can be accurately obtained by sensors. Redundant
345 observations not only complicate the structure of the network, thus increasing the
346 training difficulty, but also ignore the prior knowledge of the designer. Through radar
347 and filtering technology, information such as distance, closing speed, line-of-sight angle,
348 and line-of-sight angle rate can be obtained, which are also commonly required in
349 classical guidance laws. Therefore, the observation of the agent is eventually selected as
351 It should be noted that both in training the missile agent and in training the
352 target/defender agent, the selected observation is the in Eq. 19. The observation does
353 not impose additional hardware equipment requirements on the vehicle and are capable
354 of interfacing with existing weapons.
355 In addition, although the TMD engagement is divided into two phases,
356 and , the observations associated with the defender are not set to zero during
357 in order to ensure the stability of the network updating.
358 3.5. Curricula for Steady Training

359 ConsideringGiven the difficulty of training directly, the curriculum learning
360 approach was adopted to delineate environments of varying difficulty, thus allowing the
361 agent to start with simple tasks and gradually adapt to the complex environment. The
362 curricula are set to a different range of randomness for the initial conditions. The
363 randomness of the initial conditions is reflected in the position of the vehicle (both
364 lateral and longitudinal ), the velocity , and the flight path angle including the
365 pointing error. The greater randomness of the initial conditions implies greater
36
38
366 uncertainty and complexity of the environment. If the initial conditions are generated
367 from a completely random range at the beginning, it will be difficult to stabilize the
368 training of the agent. The curricula are set up to start training from a smaller range of
369 random initial conditions and gradually expand the randomness of the initial
370 conditions.
371 Assuming that the variable belongs to , when the total training step
372 reaches , the random range of the variable is
373 2020\*
374 MERGEFORMAT ()
375 where is the scheduling variable for curricula difficulty. The training scheduler is depicted in Figure 3, from which
376 it can be seen that the random range keeps expanding, and by the time the training step reaches , the random
377 range has basically coincided with the complete environment.
378 The growth rate of the range of random initial conditions is related to the difficulty
379 of the environment. For more difficult environments, is required to be larger. This
380 involves a trade-off between training stability and training time consumption. For
381 scenarios with difficult initial conditions, the probability distribution of random
382 numbers can be designed to adjust the curricula. In the next training, we will choose the
383 uniform distribution for initialization.
384
385 Figure 3.Curricula training scheduler curve.
386 3.6. Strategy Update Algorithm

387 With the MDP constructed, the reinforcement learning algorithm applied to train
388 the agents is selected. In recent years, along with the development of deep learning,
389 reinforcement learning has evolved into deep reinforcement learning and has made
390 breakthroughs in a series of interactive decision problems. Algorithms that have
39
41
391 received wide attention include the TD3 algorithm (Twin Delayed Deep Deterministic
392 Policy Gradient) [59], the SAC algorithm (Soft Actor Critic) [60], and the PPO algorithm
393 (Proximal Policy Optimization) [61]. In this study, we adopt the PPO algorithm, which is
394 insensitive to hyperparameters, stable in the training process, and suitable for training in
395 dynamic environments with continuous action spaces.
396 At any moment , the agents take actions based on the current observation
397 from sensors and the embedded trained policy , driving the dynamic system to
398 the next state , and receiving the corresponding reward . The interaction
399 process exists until the end of the three-body game, which is called an episode. The
400 agent and environment concurrently give rise to a sequence ,
401 which is defined as a trajectory.
402 The goal of agent is to solve the optimal policy so as to maximize the expected
403 cumulative discount reward, which is usually formalized by the state-value function
404 and the state-action value function
405 2121\*
406 MERGEFORMAT ()
407 The advantage function is also calculated to estimate how advantageous an action
408 is relative to the expected optimal action under the current policy
410 In the PPO algorithm, the objective function expected to be maximized is
411 represented as
412 2323\*
413 MERGEFORMAT ()
414 where is a hyperparameter to restrict the size of policy updates, and the probability ratio is .
415 Equation 23 implies that the advantage function will be clipped if the probability ratio between the new policy and the
416 old policy falls outside the range and . The probability ratio measures how different the two policies are.
417 The clipped objective function ensure that excessive policy updates are avoided through clipping the estimated
418 advantage function.
42
44
419 To further improve the performance of the algorithm, a value function loss term
420 for the estimation accuracy of critic network and an entropy maximum bonus
421 for encouraging exploration are introduced into the surrogate objective
422 2424\*
423 MERGEFORMAT ()
424 where and are corresponding coefficients. The purpose of the algorithm is to update the parameters of the
425 neural network to maximize the surrogate objective with respect to .
426 4. Intelligent Game Strategies
427 4.1. Attack Strategy for the Missile

428 The mission of an attacking missile is to evade the defender in flight and ultimately
429 hit the target. Therefore, it is important to balance the needs of both evasion and attack
430 during the training process, during which favoring either side will result in mission
431 failure.
432 The reward function of training missile agent is designed as
433 2525\*
434 MERGEFORMAT ()
435 where indicates the closing speed between the missile and the defender, and when indicates that the
436 distance is increasing, meaning that the confrontation between missile and defender is over. This means that the
437 penalty due to the defender’s proximity to the missile is no longer available after .
438 The exponential form in Eq.10 is employed as a reward to guide the missile
439 agent to control the zero-effort-miss with the target to zero. The vehicle cannot directly
440 measure the zero-effort-miss during flight, but choosing as the variable for the
441 reward function does not impose additional requirements on the detection hardware
442 equipment. This is because the reward function is only utilized for offline training and
443 not for online implementation.
444 In addition, is combined with the stairs form as a penalty to guide the missile
445 agent to avoid the defender’s interception. The variable chosen for the penalty is the
446 distance between the missile and the defender rather than the zero-effort-miss .
447 This is because the maneuver moment has a direct effect on the terminal miss. The
448 missile does not have to start maneuvering prematurely when is close to zero, which
449 tends to cause unnecessary waste of energy while creating additional difficulties for later
450 attacking targets. A better evasion can be achieved by maneuvering when the defender
451 is approaching.
45
47
452
453 Figure 4. Reward shaping for missile agent.
454 Taking and , the first two terms of the reward function can be
455 plotted as Figure 4. It can be seen that when the defender is far from the missile, the
456 reward function encourages the missile to shorten its distance from the target. As
457 decreases, the overall reward also decreases to become negative, which means that the
458 penalty dominates, so the missile’s mission at this time is mainly to evade the defender.
459 Furthermore, taking , , , and , the agent will receive a
460 decisive penalty when the defender is close to intercepting the missile.
461 4.2. Active defense strategy for the Target/Defender

462 The target and defender share the common mission of intercepting the incoming
463 missile with the defender, thus ensuring the target’s successful survival. The target-
464 defender cooperative strategy also consists of two parts: on the one hand, the defender
465 attempts to hit the incoming attacking missile, and on the other hand, the target attempts
466 to cause the incoming missile to miss the target as much as possible by maneuvering.
467 The reward function of training target/defender agent is designed as
469 For the target/defender, the zero-effort-miss is more appropriate for the reward
470 function than the distance. For the defender, the purpose of guidance is to make the
471 zero-effort-miss zero, while for the target, the purpose of evasion is to maximize the
472 zero-effort-miss . Taking and , the first two terms of the reward
473 function will be the same as in Figure 4, except that is replaced by and is
48
50
474 replaced by . When is small, it means that the target and the missile are already in
475 the intercept triangle, which is extremely detrimental to the target’s survival, so the
476 overall reward then decreases to a negative value. When is relatively large, it means
477 that the target is safe and the purpose of training is to improve the accuracy of the
478 defender to intercept the missile. Since the zero-effort-miss converges faster than the
479 distance, is taken, while , , and are the same as in the case of training
480 missile agent.
481 5. Simulation Results and Analysis
482 5.1. Training Setting

483 The random initial conditions for training are set as listed in Table 2. The initial
484 positions of the target and defender are randomly generated within a certain airspace,
485 and the defender is closer to the missile than the target thus satisfying the timeline
486 assumption . The initial position of the defender is before the target, which can
487 be considered as a missile launched by other platforms or as a missile launched by the
488 target at long range entering the end-guidance phase. The initial position of the missile is
489 fixed because the absolute position of each vehicle is not directly involved in the
490 training, but only its relative position is considered. The data for the initial conditions
491 are samples drawn from a uniform distribution. In other words, any value within a
492 given interval is equally likely to be drawn uniformly. We implement sampling via the
493 uniform function in python's random library. In terms of the capabilities of each vehicle,
494 the attacking missile has the greatest available overload and the fastest response time,
495 while the defender and target have weaker maneuvering capabilities than the missile.
496 Table 2. Initial parameters for training.
Parameters Missile Target Defender

Lateral position/m 0 [3000,4000] [1500,2500]
Longitudinal position/m 1000 [500,1500] [500,1500]
Max load/g 15 5 10
Time constant 0.1 0.2 0.3
Flight path angle/(°) 0
Velocity/( ) [250,300] [150,200] [250,300]
Kill radius/m 5 — 5
497 The training algorithm adopts the PPO algorithm with Actor-Critic architecture, the
498 relevant hyperparameters and neural network structure are listed in Table 3.
51
53
499 Table 3. Hyperparameters for training.
Hyperparameters Value
Ratio clipping 0.3
Learning rate
Discount-rate 0.99
Buffer size
Actor network for M 8-16-16-16-2
Critic network for M 8-16-16-16-1
Actor network for T/D 8-16-16-16-4
Critic network for T/D 8-16-16-16-1
500 First, the missile agent is trained with a curriculum-based learning approach. As the
501 randomness of the initial conditions increases, the complexity of the environment and
502 the difficulty of the task grows. The target against the missile agent takes a constant
503 maneuver of random size, and the defender employs proportional navigation guidance
504 law. Then, based on the obtained attack strategy of the missile, the missile agent is
505 utilized to train the active defense strategy of the target/defender. That is, the
506 target/defender is confronted with an intelligent missile that has the ability to evade the
507 defender and attack the target from the beginning of the training. All training and
508 simulations are carried out on a computer with an Intel Xeon Gold 6152 processor and a
509 NVIDIA GeForce RTX 2080 Ti GPU. The environment and algorithm are programmed in
510 python 3.7, and the neural network is built by using PyTorch framework. Both the actor
511 network and the critic network for the missile and the target/defender contain three
512 hidden layers with 16 neurons each. The activation function of the network adopts the
513 ReLU function. If the number of multiplication and addition operations in network
514 propagation is taken to characterize the time complexity, the complexity of the actor
515 network for the missile can be calculated as 672, the complexity of the actor network for
516 the target/defender as 704 and the complexity of the two critic networks as 656. It can be
517 seen that these networks have relatively simple architectures, occupy little storage space,
518 are fast in operations (each computation taking about 0.4-0.5ms on average on a 2.1GHz
519 CPU), and therefore have the potential to be employed onboard.
520 It should be noted that the rise of the cumulative reward curve will not be accepted
521 as a criterion for training success, since the evolution of the curricula from easy to
522 difficult determines the agent can complete easy missions and thus earns high return at
523 the beginning of the training.
524 5.2. Simulation Analysis of Attack Strategy for Missile

525 5.2.1. Engagement in Different Scenarios
54
56
526 In order to verify the effectiveness of the attack strategy of the trained missile agent,
527 we set up different scenarios to assess the agent, with the simulation conditions
528 presented in Table 3. The defender adopts a proportional navigation guidance law with
529 an effective navigation ratio of 4. The target adopts a constant maneuver with random
530 direction and magnitude. The simulation results for different target positions and
531 different defender positions are presented in Figure 5.
532 Table 4. Initial parameters for training.
Parameters Missile Target Defender

Lateral position/m 0 3500 2500
Longitudinal position/m 1000 1300/1000/700 1300/1000/700
Velocity/( ) 250 150 250
Kill radius/m 5 — 5
533 The relative positions of the target and defender cover most typical scenarios, so the
534 simulation results are representative. Regardless of whether the missile faces a target at
535 high altitude, a target at low altitude, or a target at a comparable altitude, and regardless
536 of the direction from which the defender intercepts, the missile can avoid the defender
537 and eventually hit the target. The missile with an intelligent attack strategy will aim at
538 the target in the primary direction, but rapidly maneuvers when the defender threatens
539 itself, causing the defender to fail to intercept the missile.
(a) (b) (c)
57
59
(d) (e) (f)
(g) (h) (i)

540 Figure 5. Engagement trajectories of assessing missile agent under different simulation conditions,
541 i.e., different target longitudinal position and defender longtitudinal position : (a) Longitu-
542 dinal position , ; (b) Longitudinal position , ;
543 (c) Longitudinal position , ; (d) Longitudinal position ,
544 ; (e) Longitudinal position , ; (f) Longitudinal position
545 , ; (g) Longitudinal position , ; (h) Longitudinal
546 position , ; (i) Longitudinal position , .
547 5.2.2. Analysis of Typical Engagement Process

548 By further analyzing the engagement process in Figure 5(a), we can obtain more
549 insight into the intelligent attack strategy obtained based on DRL. The moment when the
550 defender misses the missile has been marked in Figure 5(a) as 5.16s with the miss
551 distance of 40.20m, which is safe for the missile. The moment the missile finally hit the
552 target is 9.77s, and the off-target amount is 1.13m, completing the combat mission.
553 As shown in Figure 8(a), the missile quickly maneuvers between capability
554 boundaries as the defender approaches itself and poses a threat, which is a bang-bang
555 form of control law. The sudden and drastic maneuver of the missile does not allow the
556 defender enough time to change the direction of flight and thus the interception to the
60
62
557 missile fails. As can be seen from the zero-effort-miss in Figure 8(b), the defender’s zero-
558 effort-miss for the missile increases at the last moment due to the missile’s maneuver.
559 Then, the missile rapidly changes its acceleration direction after evading the defender,
560 thus compensating for the deviation in aiming at the target caused by the previous
561 maneuver. The zero-effort-miss of the missile to the target eventually converged to zero
562 although it experienced fluctuations due to the missile’s maneuvers.
(a) (b)
563 Figure 6. Engagement process in Figure 5(a). (a) The overload of each vehicle; (b) The zero-effort-
564 miss and .
565 5.2.3. Performance under Uncertainty Disturbances

566 When transferring the trained policy to the practical system, it faces various
567 uncertainty disturbances, among which the observation noise and the inaccurate model
568 used for training can negatively affect the performance of the agent. We count the
569 success rate of the missile agent in the face of disturbances, and the results are listed in
570 Table 5. The initial conditions are initialized stochastically and noise disturbances are
571 added to the observation. As for the model uncertainty, the response time constant bias
572 and the higher-order system bias are considered. The higher-order system takes the
573 binomial form that is commonly adopted in guidance system evaluation, i.e., the third-
574 order system is represented as

576 As shown from the simulation results, the trained policy is able to maintain a high
577 success rate for a certain range of disturbances. The error of the response time is more
578 important than the observation noise and the order of the model.
63
65
579 Table 5. Success rate under uncertainty disturbances.
Observation Noise 3rd order 3rd order

5% 89.2% 79.1% 88.4% 76.5%
15% 89.0% 76.4% 86.5% 75.3%
25% 82.5% 75.5% 78.5% 75.1%
35% 75.0% 74.1% 79.0% 74.2%
580
581 5.3. Simulation Analysis of Active Defense Strategy for Target/Defender

582 5.3.1. Engagement in Different Scenarios
583 In the same scenario as for validating the missile agent, we further assess the
584 effectiveness of the intelligent active defense policy for the target/defender agent
585 obtained by DRL training. Considering that the maximum overload and dynamic
586 response of the missile, i.e., maneuverability and agility, are far superior to the target, it
587 is difficult for the target to survive on its own maneuvering next if the defender fails to
588 intercept the attacking missile in time. If the defender fails to intercept, then the target-
589 missile engagement becomes a one-to-one pursuit-evasion game problem, which has
590 been studied in many literatures [62–64].
591 The simulation results for different engagement scenarios are illustrated in Figure 7.
592 The missile utilizes the DRL-based intelligent attack strategy and the target/defender
593 adopts the intelligent active defense strategy trained with the missile agent. In all
594 scenarios, the defender successfully intercepts the missile. Unlike the missile facing a
595 defender employing the proportional navigation guidance law, the missile in these cases
596 do not take timely and effective maneuvers to evade the defender. The attack strategy of
597 the missile agent, which is essentially a neural network, seems to be paralyzed. This
598 suggests that the cooperative actions of the target/defender perform an effect similar to
599 network adversarial attack, a widely noticed phenomenon in deep learning classifiers
600 where the researcher can add a little noise to the network input thereby disabling the
601 trained deep network [65,66].
602 The reinforcement learning agent relies on observation to output action, so the
603 target/defender can maneuver to influence the missile agent's observation, thus making
604 the missile agent output an invalid action. In the simulation results of Figure 7, (e) is
605 rather special. The target does not maneuver, resulting in a direct head-on attack of the
606 missile, and consequently the defender intercepts the missile easily. It is because the
607 network inputs to the missile agent are all zero or constant values, which reflects an
608 unexpected flaw of the intelligent strategy obtained by DRL training. Intelligent
609 strategies based on neural networks may be tricked and defeated by very simple
610 adversaries, which should attract sufficient attention in future research.
66
68
(a) (b) (c)
(d) (e) (f)
(g) (h) (i)

611 Figure 7. Engagement trajectories of assessing target/defender agent under different simulation
612 conditions, i.e., different target longitudinal position and defender longtitudinal position :
613 (a) Longitudinal position , ; (b) Longitudinal position ,
614 ; (c) Longitudinal position , ; (d) Longitudinal position
615 , ; (e) Longitudinal position , ; (f) Longitudi-
616 nal position , ; (g) Longitudinal position , ; (h)
69
71
617 Longitudinal position , ; (i) Longitudinal position ,

618 .
619 5.3.2. Analysis of Typical Engagement Process

620 By further analyzing the engagement process in Figure 7(a), we can obtain more
621 insight into the intelligent active defense strategy obtained based on DRL. The moment
622 when the defender intercepts the missile has been marked in Figure 7(a) as 5.09s with
623 the miss distance of 1.83m.
624 As shown in Figure 6, the missile does not maneuver to its maximum capability as
625 the defender approaches, and its timing of the maneuver lags and is eventually
626 intercepted by the defender. Besides, the curves of the zero-effort-miss show that the
627 defender has locked the missile on the intercept triangle at about 2s, while the missile is
628 late in locking the target, which also reflect the failure of the missile attack strategy.
(a) (b)
629 Figure 8. Block diagram of proportional navigation guidance system. (a) Original guidance sys-
630 tem; (b) Adjoint guidance system.
631 5.3.3. Performance under Uncertainty Disturbances

632 As in Section 5.2.3, we validate the robustness of the target/defender agent’s policy
633 in the face of observation noise and model inaccuracy. As shown in Table 6, the
634 target/defender agent’s policy is more robust to uncertainty disturbances and achieves a
635 higher success rate in general compared to the missile. This is because the difficulty of
636 attack is inherently higher than the difficulty of defense, and the target/defender’s policy
637 is a targeted adversarial attack training for the missile’s policy.
638 Table 6. Success rate under uncertainty disturbances.
72
74
C-DRL CLQDG
Observation
Noise
3rd order 3rd order 3rd order
5% 98.4% 87.2% 93.5% 82.2% 70.0% 67.3%
15% 94.5% 86.6% 94.0% 81.0% 68.0% 66.7%
25% 95.0% 87.0% 92.5% 79.8% 53.3% 50.1%
35% 93.4% 85.7% 93.7% 79.6% 38.2% 37.3%
639 Besides, in Table 6, we compare the curriculum-based DRL approach (C-DRL) with
640 the cooperative linear quadratic differential game (CLQDG) guidance law, which is a
641 classical guidance law in the TMD scenario [55]. The gains of the CLQDG guidance law
642 do not involve response time, so we only analyze the effect of input noise and system
643 order. Since ideal dynamics is assumed in the derivation of the gains, the effect of the
644 order of the system is more pronounced. Facing the input noise, the performance of the
645 C-DRL approach decreases insignificantly, and the robustness of CLQDG is not as
646 strong as that of C-DRL approach. In addition, for complex three-dimensional multi-
647 body game problems, the differential game approach to derive analytic guidance law
648 may not work, so the reinforcement learning approach has greater potential for
649 development.
650 6. Conclusions
651 For the scenario of target-missile-defender three-body offensive and defensive
652 confrontation, intelligent game strategies using curriculum-based deep reinforcement
653 learning are proposed, including attack strategy for attacking missile and active defense
654 strategy for target/defense. The results of differential game are combined with deep
655 reinforcement learning algorithms to make the agent training with clearer direction and
656 better adapt to the complex environment with stronger nonlinearity. The three-body
657 adversarial game is constructed as MDP suitable for reinforcement learning training by
658 analyzing the sign of the action space and designing the reward function in the
659 adversarial form. The missile agent and target/defender agent are trained with a
660 curriculum learning approach to obtain the intelligent game strategies. Through
661 simulation verification we can draw the following conclusions.
662 (1) Employing the curriculum-based DRL trained attack strategy, In simulations
663 that validate the attack strategy of the missile, the missile is able to avoid the
664 defender and hit the target in various situations.
665 (2) Employing the curriculum-based DRL trained attack strategy,In simulations
666 that validate the active defense strategy of the target/defender, the less capable
667 target/defender is able to achieve an effect similar to network adversarial attack
75
77
668 against the missile agent. The defender intercepts the missile before the it hits
669 the target.
670 (3) The intelligent game strategies are able to maintain robustness in the face of
671 disturbances from input noise and modeling inaccuracies.
672 In future research, three-dimensional scenarios with multiple attacking missiles,
673 multiple defenders, and multiple targets will be considered. The battlefield environment
674 is becoming more complicated, and the traditional differential game and weapon-target
675 assignment methods will show more obvious limitations, while the intelligent game
676 strategy based on DRL has better adaptability for complex scenarios. Motion analysis in
677 three dimensions can be conducted utilizing vector guidance laws, or decomposed into
678 two perpendicular channels and solved in the plane, as has been proven to be possible in
679 previous research. Combined with DRL, more complex multi-body game problems are
680 expected to be solved. Technologies such as self-play and adversarial attack will also be
681 applied to the generation and analysis of game strategies. In addition, considering the
682 difficulty of obtaining battlefield observations, the training algorithm needs to be
683 improved to adapt to the scenarios with imperfect information.
684 Author Contributions: The contributions of the authors are the following: Conceptualization,
685 W.C. and X.G.; methodology, X.G. and Z.C.; validation, Z.C.; formal analysis, X.G.; investigation,
686 X.G.; resources, X.G. and Z.C.; writing—original draft preparation, X.G.; writing—review and
687 editing, W.C. and Z.C.; visualization, X.G.; supervision, W.C.; project administration, W.C..;
688 funding acquisition, Z.C. All authors have read and agreed to the published version of the
689 manuscript.
690 Funding: This research was funded by China Postdoctoral Science Foundation (Grant No.
691 2021M700321).
692 Data Availability Statement: All data used during the study appear in the submitted article.
693 Acknowledgments: The study described in this paper was supported by China Postdoctoral
694 Science Foundation (Grant No. 2021M700321). The authors fully appreciate the financial support.
695 Conflicts of Interest: The authors declare no conflict of interest.
696
78
80
697 References
698 1. Li, C.; Wang, J.; Huang, P. Optimal Cooperative Line-of-Sight Guidance for Defending a Guided Missile. Aerospace 2022, 9,
699 232, doi:10.3390/aerospace9050232.
700 2. Li, Q.; Yan, T.; Gao, M.; Fan, Y.; Yan, J. Optimal Cooperative Guidance Strategies for Aircraft Defense with Impact Angle
701 Constraints. Aerospace 2022, 9, 710, doi:10.3390/aerospace9110710.
702 3. Liang, H.; Li, Z.; Wu, J.; Zheng, Y.; Chu, H.; Wang, J. Optimal Guidance Laws for a Hypersonic Multiplayer Pursuit-Evasion
703 Game Based on a Differential Game Strategy. Aerospace 2022, 9, 97, doi:10.3390/aerospace9020097.
704 4. Shi, H.; Chen, Z.; Zhu, J.; Kuang, M. Model predictive guidance for active aircraft protection from a homing missile. IET Con-
705 trol Theory & Applications 2022, 16, 208–218, doi:10.1049/cth2.12218.
706 5. Kumar, S.R.; Mukherjee, D. Cooperative Active Aircraft Protection Guidance Using Line-of-Sight Approach. IEEE Transac-
707 tions on Aerospace and Electronic Systems 2021, 57, 957–967, doi:10.1109/TAES.2020.3046328.
708 6. Yan, M.; Yang, R.; Zhang, Y.; Yue, L.; Hu, D. A hierarchical reinforcement learning method for missile evasion and guidance.
709 Sci. Rep. 2022, 12, 18888, doi:10.1038/s41598-022-21756-6.
710 7. Liang, H.; Wang, J.; Wang, Y.; Wang, L.; Liu, P. Optimal guidance against active defense ballistic missiles via differential
711 game strategies. Chin. J. Aeronaut. 2020, 33, 978–989, doi:10.1016/j.cja.2019.12.009.
712 8. Ratnoo, A.; Shima, T. Line-of-Sight Interceptor Guidance for Defending an Aircraft. J. Guid. Control Dyn. 2011, 34, 522–532,
713 doi:10.2514/1.50572.
714 9. Yamasaki, T.; Balakrishnan, S. Triangle Intercept Guidance for Aerial Defense. In AIAA Guidance, Navigation, and Control Con-
715 ference; American Institute of Aeronautics and Astronautics, 2010.
716 10. Yamasaki, T.; Balakrishnan, S.N.; Takano, H. Modified Command to Line-of-Sight Intercept Guidance for Aircraft Defense. J.
717 Guid. Control Dyn. 2013, 36, 898–902, doi:10.2514/1.58566.
718 11. Yamasaki, T.; Balakrishnan, S.N. Intercept Guidance for Cooperative Aircraft Defense against a Guided Missile. IFAC Proceed-
719 ings Volumes 2010, 43, 118–123, doi:10.3182/20100906-5-JP-2022.00021.
720 12. Liu, S.; Wang, Y.; Li, Y.; Yan, B.; Zhang, T. Cooperative guidance for active defence based on line-of-sight constraint under a
721 low-speed ratio. The Aeronautical Journal 2022, 1–19, doi:10.1017/aer.2022.62.
722 13. Shaferman, V.; Oshman, Y. Stochastic Cooperative Interception Using Information Sharing Based on Engagement Staggering.
723 J. Guid. Control Dyn. 2016, 39, 2127–2141, doi:10.2514/1.G000437.
724 14. Prokopov, O.; Shima, T. Linear Quadratic Optimal Cooperative Strategies for Active Aircraft Protection. J. Guid. Control Dyn.
725 2013, 36, 753–764, doi:10.2514/1.58531.
726 15. Shima, T. Optimal Cooperative Pursuit and Evasion Strategies Against a Homing Missile. J. Guid. Control Dyn. 2011, 34, 414–
727 425, doi:10.2514/1.51765.
81
83
728 16. Alkaher, D.; Moshaiov, A. Game-Based Safe Aircraft Navigation in the Presence of Energy-Bleeding Coasting Missile. J. Guid.
729 Control Dyn. 2016, 39, 1539–1550, doi:10.2514/1.G001676.
730 17. Liu, F.; Dong, X.; Li, Q.; Ren, Z. Cooperative differential games guidance laws for multiple attackers against an active defense
731 target. Chinese Journal of Aeronautics 2022, 35, 374–389, doi:10.1016/j.cja.2021.07.033.
732 18. Chen, W.; Cheng, C.; Jin, B.; Xu, Z. Research on differential game guidance law for intercepting hypersonic vehicles. In 6th In-
733 ternational Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022). 6th International Workshop on Ad-
734 vanced Algorithms and Control Engineering (IWAACE 2022), Qingdao, China, 2022/7/8 - 2022/7/10; Qiu, D., Ye, X., Sun, N.,
735 Eds.; SPIE, 2022; p 94, ISBN 9781510657755.
736 19. Rubinsky, S.; Gutman, S. Three-Player Pursuit and Evasion Conflict. J. Guid. Control Dyn. 2014, 37, 98–110,
737 doi:10.2514/1.61832.
738 20. Rubinsky, S.; Gutman, S. Vector Guidance Approach to Three-Player Conflict in Exoatmospheric Interception. J. Guid. Control
739 Dyn. 2015, 38, 2270–2286, doi:10.2514/1.G000942.
740 21. Garcia, E.; Casbeer, D.W.; Pachter, M. Pursuit in the Presence of a Defender. Dyn Games Appl 2019, 9, 652–670, doi:10.1007/
741 s13235-018-0271-9.
742 22. Garcia, E.; Casbeer, D.W.; Pachter, M. The Complete Differential Game of Active Target Defense. J Optim Theory Appl 2021,
743 191, 675–699, doi:10.1007/s10957-021-01816-z.
744 23. Garcia, E.; Casbeer, D.W.; Fuchs, Z.E.; Pachter, M. Cooperative Missile Guidance for Active Defense of Air Vehicles. IEEE
745 Trans. Aerosp. Electron. Syst. 2018, 54, 706–721, doi:10.1109/TAES.2017.2764269.
746 24. Garcia, E.; Casbeer, D.W.; Pachter, M. Design and Analysis of State-Feedback Optimal Strategies for the Differential Game of
747 Active Defense. IEEE Trans. Autom. Control 2018, 64, 1, doi:10.1109/TAC.2018.2828088.
748 25. Liang, L.; Deng, F.; Lu, M.; Chen, J. Analysis of Role Switch for Cooperative Target Defense Differential Game. IEEE Trans.
749 Autom. Control 2021, 66, 902–909, doi:10.1109/TAC.2020.2987701.
750 26. Liang, L.; Deng, F.; Peng, Z.; Li, X.; Zha, W. A differential game for cooperative target defense. Automatica 2019, 102, 58–71,
751 doi:10.1016/j.automatica.2018.12.034.
752 27. Qi, N.; Sun, Q.; Zhao, J. Evasion and pursuit guidance law against defended target. Chin. J. Aeronaut. 2017, 30, 1958–1973,
753 doi:10.1016/j.cja.2017.06.015.
754 28. Shaferman, V.; Shima, T. Cooperative Multiple-Model Adaptive Guidance for an Aircraft Defending Missile. J. Guid. Control
755 Dyn. 2010, 33, 1801–1813, doi:10.2514/1.49515.
756 29. Shaferman, V.; Shima, T. Cooperative Differential Games Guidance Laws for Imposing a Relative Intercept Angle. J. Guid.
757 Control Dyn. 2017, 40, 2465–2480, doi:10.2514/1.G002594.
84
86
758 30. Saurav, A.; Kumar, S.R.; Maity, A. Cooperative Guidance Strategies for Aircraft Defense with Impact Angle Constraints. In
759 AIAA Scitech 2019 Forum. AIAA Scitech 2019 Forum, San Diego, California, 2019/01/07; American Institute of Aeronautics and
760 Astronautics: Reston, Virginia, 2019, ISBN 978-1-62410-578-4.
761 31. Liang, H.; Wang, J.; Liu, J.; Liu, P. Guidance strategies for interceptor against active defense spacecraft in two-on-two engage-
762 ment. Aerosp. Sci. Technol. 2020, 96, 105529, doi:10.1016/j.ast.2019.105529.
763 32. Shalumov, V.; Shima, T. Weapon–Target-Allocation Strategies in Multiagent Target–Missile–Defender Engagement. J. Guid.
764 Control Dyn. 2017, 40, 2452–2464, doi:10.2514/1.G002598.
765 33. Sun, Q.; Qi, N.; Xiao, L.; Lin, H. Differential game strategy in three-player evasion and pursuit scenarios. J. Syst. Eng. Electron.
766 2018, 29, 352–366, doi:10.21629/JSEE.2018.02.16.
767 34. Sun, Q.; Zhang, C.; Liu, N.; Zhou, W.; Qi, N. Guidance laws for attacking defended target. Chin. J. Aeronaut. 2019, 32, 2337–
768 2353, doi:10.1016/j.cja.2019.05.011.
769 35. Chai, R.; Tsourdos, A.; Savvaris, A.; Chai, S.; Xia, Y.; Philip Chen, C.L. Review of advanced guidance and control algorithms
770 for space/aerospace vehicles. Progress in Aerospace Sciences 2021, 122, 100696, doi:10.1016/j.paerosci.2021.100696.
771 36. Liu, Y.; Wang, H.; Wu, T.; Lun, Y.; Fan, J.; Wu, J. Attitude control for hypersonic reentry vehicles: An efficient deep reinforce-
772 ment learning method. Appl. Soft Comput. 2022, 123, 108865, doi:10.1016/j.asoc.2022.108865.
773 37. Gaudet, B.; Furfaro, R.; Linares, R. Reinforcement learning for angle-only intercept guidance of maneuvering targets. Aerosp.
774 Sci. Technol. 2020, 99, 1–10, doi:10.1016/j.ast.2020.105746.
775 38. He, S.; Shin, H.-S.; Tsourdos, A. Computational Missile Guidance: A Deep Reinforcement Learning Approach. Journal of Aero-
776 space Information Systems 2021, 18, 571–582, doi:10.2514/1.I010970.
777 39. Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing
778 via a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171, doi:10.1016/j.actaastro.2020.02.051.
779 40. Gaudet, B.; Linares, R.; Furfaro, R. Adaptive guidance and integrated navigation with reinforcement meta-learning. Acta As-
780 tronaut. 2020, 169, 180–190, doi:10.1016/j.actaastro.2020.01.007.
781 41. He, L.; Aouf, N.; Song, B. Explainable Deep Reinforcement Learning for UAV autonomous path planning. Aerosp. Sci. Technol.
782 2021, 118, 107052, doi:10.1016/j.ast.2021.107052.
783 42. Wang, Y.; Dong, L.; Sun, C. Cooperative control for multi-player pursuit-evasion games with reinforcement learning. Neuro-
784 computing 2020, 412, 101–114, doi:10.1016/j.neucom.2020.06.031.
785 43. English, J.T.; Wilhelm, J.P. Defender-Aware Attacking Guidance Policy for the Target–Attacker–Defender Differential Game.
786 Journal of Aerospace Information Systems 2021, 18, 366–376, doi:10.2514/1.I010877.
787 44. Shalumov, V. Cooperative online Guide-Launch-Guide policy in a target-missile-defender engagement using deep reinforce-
788 ment learning. Aerosp. Sci. Technol. 2020, 104, 105996, doi:10.1016/j.ast.2020.105996.
87
89
789 45. Qiu, X.; Gao, C.; Jing, W. Maneuvering penetration strategies of ballistic missiles based on deep reinforcement learning. Proc.
790 Inst. Mech. Eng., Part G: J. Aerosp. Eng. 2022, 095441002210883, doi:10.1177/09544100221088361.
791 46. Radac, M.-B.; Lala, T. Robust Control of Unknown Observable Nonlinear Systems Solved as a Zero-Sum Game. IEEE Access
792 2020, 8, 214153–214165, doi:10.1109/ACCESS.2020.3040185.
793 47. Zhao, M.; Wang, D.; Ha, M.; Qiao, J. Evolving and Incremental Value Iteration Schemes for Nonlinear Discrete-Time Zero-
794 Sum Games. IEEE Trans. Cybern. 2022, PP, doi:10.1109/TCYB.2022.3198078.
795 48. Xue, S.; Luo, B.; Liu, D. Event-Triggered Adaptive Dynamic Programming for Zero-Sum Game of Partially Unknown Contin-
796 uous-Time Nonlinear Systems. IEEE Trans. Syst. Man Cybern, Syst. 2020, 50, 3189–3199, doi:10.1109/TSMC.2018.2852810.
797 49. Wei, Q.; Liu, D.; Lin, Q.; Song, R. Adaptive Dynamic Programming for Discrete-Time Zero-Sum Games. IEEE Trans. Neural
798 Networks Learn. Syst. 2018, 29, 957–969, doi:10.1109/TNNLS.2016.2638863.
799 50. Zhu, Y.; Zhao, D.; Li, X. Iterative Adaptive Dynamic Programming for Solving Unknown Nonlinear Zero-Sum Game Based
800 on Online Data. IEEE Trans. Neural Networks Learn. Syst. 2017, 28, 714–725, doi:10.1109/TNNLS.2016.2561300.
801 51. Jiang, H.; Zhang, H.; Han, J.; Zhang, K. Iterative adaptive dynamic programming methods with neural network implementa-
802 tion for multi-player zero-sum games. Neurocomputing 2018, 307, 54–60, doi:10.1016/j.neucom.2018.04.005.
803 52. Wang, W.; Chen, X.; Du, J. Model-free finite-horizon optimal control of discrete-time two-player zero-sum games. Interna-
804 tional Journal of Systems Science 2023, 54, 167–179, doi:10.1080/00207721.2022.2111236.
805 53. Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey. In 2020
806 IEEE Symposium Series on Computational Intelligence (SSCI). 2020 IEEE Symposium Series on Computational Intelligence
807 (SSCI), Canberra, ACT, Australia, 2020/12/1 - 2020/12/4; IEEE, 2020; pp 737–744, ISBN 978-1-7281-2547-3.
808 54. Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Confer-
809 ence on Machine Learning - ICML '09. the 26th Annual International Conference, Montreal, Quebec, Canada, 2009/6/14 -
810 2009/6/18; Danyluk, A., Bottou, L., Littman, M., Eds.; ACM Press: New York, New York, USA, 2009; pp 1–8, ISBN
811 9781605585161.
812 55. Perelman, A.; Shima, T.; Rusnak, I. Cooperative Differential Games Strategies for Active Aircraft Protection from a Homing
813 Missile. J. Guid. Control Dyn. 2011, 34, 761–773, doi:10.2514/1.51611.
814 56. Wang, X.; Chen, Y.; Zhu, W. A Survey on Curriculum Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4555–4576,
815 doi:10.1109/TPAMI.2021.3069908.
816 57. Soviany, P.; Ionescu, R.T.; Rota, P.; Sebe, N. Curriculum Learning: A Survey. Int. J. Comput. Vis. 2022, 130, 1526–1565,
817 doi:10.1007/s11263-022-01611-x.
818 58. Zarchan, P. Tactical and strategic missile guidance, 6th ed.; American Institute of Aeronautics and Astronautics: Reston, Va.,
819 2012, ISBN 978-1-60086-894-8.
90
92
820 59. Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the
821 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR, 2018; pp 1587–1596.
822 60. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning
823 with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning; Dy, J., Krause, A., Eds.; PMLR,
824 2018; pp 1861–1870.
825 61. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms, 2017. Available online:
826 https://arxiv.org/abs/1707.06347v2.
827 62. Liu, F.; Dong, X.; Li, Q.; Ren, Z. Robust multi-agent differential games with application to cooperative guidance. Aerosp. Sci.
828 Technol. 2021, 111, 106568, doi:10.1016/j.ast.2021.106568.
829 63. Wei, X.; Yang, J. Optimal Strategies for Multiple Unmanned Aerial Vehicles in a Pursuit/Evasion Differential Game. J. Guid.
830 Control Dyn. 2018, 41, 1799–1806, doi:10.2514/1.G003480.
831 64. Shaferman, V.; Shima, T. Cooperative Optimal Guidance Laws for Imposing a Relative Intercept Angle. J. Guid. Control Dyn.
832 2015, 38, 1395–1408, doi:10.2514/1.G000568.
833 65. Ilahi, I.; Usama, M.; Qadir, J.; Janjua, M.U.; Al-Fuqaha, A.; Hoang, D.T.; Niyato, D. Challenges and Countermeasures for Ad-
834 versarial Attacks on Deep Reinforcement Learning. IEEE Trans. Artif. Intell. 2022, 3, 90–109, doi:10.1109/TAI.2021.3111139.
835 66. Qiu, S.; Liu, Q.; Zhou, S.; Wu, C. Review of Artificial Intelligence Adversarial Attack and Defense Technologies. Appl. Sci.
836 (Basel) 2019, 9, 909, doi:10.3390/app9050909.
837
93

Aerospace 2128999

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Aerospace 2128999

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aerospace 2128999

Uploaded by

Copyright:

Available Formats

1

1 111 公式章 1 节 1Article

2 Intelligent Game Strategies in Target-Missile-Defender

26 Publisher’s Note: MDPI stays 1. Introduction

Copyright: © 2022 by the authors.

3 Aerospace 2022, 9, x. https://doi.org/10.3390/xxxxx www.mdpi.com/journal/aerospace

154 2. Dynamic Model of TMD Engagement

155 2.1. Nonlinear Engagement Model

169 As shown in Figure 1, the nonlinear engagement model of missile-target and

171 22\* MERGEFORMAT

175 33\* MERGEFORMAT ()

178 44\* MERGEFORMAT ()

180 2.1. Linearization and Zero-Effort-Miss

187 55\* MERGEFORMAT ()

205 88\* MERGEFORMAT ()

210 99\* MERGEFORMAT ()

215 3. Curriculum-Based DRL Algorithm

223 3.1. Deep Reinforcement Learning and Curriculum Learning

268 3.2. Reward Shaping

289 3.3. Action Selection

Gain Meaning Effective Time

312 1111\* MERGEFORMAT

315 1212\* MERGEFORMAT ()

317 1313\* MERGEFORMAT ()

320 1414\* MERGEFORMAT ()

323 1515\* MERGEFORMAT

327 1616\* MERGEFORMAT ()

330 1717\* MERGEFORMAT ()

333 1818\* MERGEFORMAT ()

342 3.4. Observation Selection

358 3.5. Curricula for Steady Training

386 3.6. Strategy Update Algorithm

426 4. Intelligent Game Strategies

427 4.1. Attack Strategy for the Missile

461 4.2. Active defense strategy for the Target/Defender

481 5. Simulation Results and Analysis

482 5.1. Training Setting

Parameters Missile Target Defender

499 Table 3. Hyperparameters for training.

524 5.2. Simulation Analysis of Attack Strategy for Missile

Parameters Missile Target Defender

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

547 5.2.2. Analysis of Typical Engagement Process

565 5.2.3. Performance under Uncertainty Disturbances

575 2727\* MERGEFORMAT ()

579 Table 5. Success rate under uncertainty disturbances.

Observation Noise 3rd order 3rd order

581 5.3. Simulation Analysis of Active Defense Strategy for Target/Defender

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

617 Longitudinal position , ; (i) Longitudinal position ,

619 5.3.2. Analysis of Typical Engagement Process

631 5.3.3. Performance under Uncertainty Disturbances

You might also like