Aerospace 10 00590
Aerospace 10 00590
Aerospace 10 00590
Article
Powered Landing Control of Reusable Rockets Based on
Softmax Double DDPG
Wenting Li 1,2 , Xiuhui Zhang 3 , Yunfeng Dong 3 , Yan Lin 1 and Hongjue Li 3,4, *
1 School of Automation Science and Electrical Engineering, Beihang University, Beijing 102206, China
2 Beijing Aerospace Automatic Control Institute, Beijing 100854, China
3 School of Astronautics, Beihang University, Beijing 102206, China
4 Shenzhen Institute of Beihang University, Shenzhen 518063, China
* Correspondence: lihongjue@buaa.edu.cn
Abstract: Multi-stage launch vehicles are currently the primary tool for humans to reach extraterres-
trial space. The technology of recovering and reusing rockets can effectively shorten rocket launch
cycles and reduce space launch costs. With the development of deep representation learning, rein-
forcement learning (RL) has become a robust learning framework capable of learning complex policies
in high-dimensional environments. In this paper, a deep reinforcement learning-based reusable rocket
landing control method is proposed. The mathematical process of reusable rocket landing is modelled
by considering the aerodynamic drag, thrust, gravitational force, and Earth’s rotation during the
landing process. A reward function is designed according to the rewards and penalties derived from
mission accomplishment, terminal constraints, and landing performance. Based on this, the Softmax
double deep deterministic policy gradient (SD3) deep RL method is applied to build a robust reusable
rocket landing control method. In the constructed simulation environment, the proposed method can
achieve convergent and robust control results, proving the effectiveness of the proposed method.
1. Introduction
Citation: Li, W.; Zhang, X.; Dong, Y.; With the expanding demand for launching satellites and spacecraft, expendable launch
Lin, Y.; Li, H. Powered Landing vehicles are becoming one of the main factors restricting the exploration of extraterrestrial
Control of Reusable Rockets Based on space. Expendable launch vehicles discard critical components such as engine and naviga-
Softmax Double DDPG. Aerospace tion equipment directly after completing their missions, raising the cost of space access [1].
2023, 10, 590. https://doi.org/ To tackle the above problems, reusable launch vehicles have gained widespread attention
10.3390/aerospace10070590 regarding their economy and flexibility [2–5].
Academic Editor: Gokhan Inalhan Research in reusable launch vehicles has made continuous progress in recent years.
As an extension of parachute landing technology, the parachute recovery method was
Received: 14 April 2023 widely studied and applied in various rockets [6–9]. Although it is a simple, reliable, and
Revised: 12 June 2023 economical method, the unpowered descent process makes it difficult to predict the final
Accepted: 27 June 2023
landing point of the rocket. As a comparison, the powered landing control method has
Published: 28 June 2023
gradually become the preferred option considering the safety, controllability, and rapidity
of the landing process. Academic research on powered landing generally apply offline or
online trajectory optimization followed by various tracking control methods for pinpoint
Copyright: © 2023 by the authors.
landing. Convex optimization is one of the typical methods to transform the optimal
Licensee MDPI, Basel, Switzerland. landing problem into a convex optimization problem where an optimal landing trajectory
This article is an open access article can be obtained [10–12]. Açıkmeşe et al. [10] introduced a convexification of the control
distributed under the terms and constraints that was proven to be lossless and used interior methods of convex optimization
conditions of the Creative Commons to obtain optimal solutions of the original non-convex optimal control problem. Blackmore
Attribution (CC BY) license (https:// [11] combined convex optimization by CVXGEN [13] with powered divert guidance to opti-
creativecommons.org/licenses/by/ mize trajectory and eliminate error. While convex-optimization-based methods can achieve
4.0/). optimal solutions, they require the problem to be convex and often become computationally
inefficient as the complexity of the problem increases. Besides convex optimization, many
other numerical optimization methods can also be applied to solve the optimal landing
problem. Sagliano et al. [14] applied a multivariate pseudospectral interpolation scheme to
subspaces of a database of pre-computed trajectories to generate real-time capable entry
guidance solutions. Seelbinder et al. [15] introduced parametric sensitivity analysis of
non-linear programs for near-optimal control sequence generation. To take advantage
of both convex optimization and pseudospectral methods, Wang et al. [16] adopted a
pseudospectral-improved successive convexification algorithm to solve the trajectory opti-
mization problem. The above-mentioned methods greatly enriched the solution diversity of
the optimal landing problem, but they usually require extensive calculations and accurate
modelling of the environment, which are difficult to satisfy in practice.
Leveraging the continuous development and breakthrough of deep learning tech-
nology, the idea of using deep neural networks (DNN) in the guidance, navigation, and
control has been widely explored in the literature [17–20]. Among all deep learning tech-
nologies, supervised learning has been extensively discussed. DNNs have been applied
to asteroid landing trajectory optimization [17] and first-stage recovery flight of reusable
launch vehicles [18]. Furfaro et al. [19,20] designed a set of convolutional neural networks
(CNN) and long-short-term memory (LSTM) architectures to handle the problem of the
autonomous lunar landing. Although supervised learning exhibits strong fitting abilities
for optimal landing controls, acquiring sufficient data required for the training process
is very time-consuming [17,19]. An optimization method (e.g., GPOPS [19]) is usually
employed before supervised learning to generate optimal state-control pairs as training
samples (146,949 samples in [17] and 2,970,000 samples in [19]). Depending on the com-
plexity of the optimization problem, a single sample can take anywhere from a few seconds
to several hours to compute, and generating such a large number of samples consumes
considerable time.
Deep reinforcement learning (DRL), on the other hand, uses trial-and-error interaction
with the environment to optimize the policy instead of learning from enormous labelled
data, thus making it useful in many fields such as chess [21], game [22,23], investment [24],
and autonomous driving [25,26]. DRL has also drawn extensive attention to reusable launch
vehicles. Ciabatti et al. [27] combined a deep deterministic policy gradient (DDPG) [28]
algorithm with transfer learning to tackle the problem of planetary landing on unknown
extra-terrestrial environments. Cheng et al. [29] derived an interactive reinforcement learn-
ing (RL) algorithm for the Moon fuel-optimal landing problem. Furfaro et al. [30] presented
an adaptive guidance algorithm based on zero-effort-miss/zero-effort-velocity (ZEM/ZEV)
and deterministic policy gradient [31]. Gaudet et al. [32] applied the principles of the
deep Q network [33] into a six degrees-of-freedom Mars landing simulation environment,
showing strong robustness.
In this paper, we present the final landing stage of a two-stage reusable rocket con-
trolled by using DRL algorithms. In contrast with existing applied RL algorithms, we apply
the Softmax double deep deterministic policy gradient (SD3) to keep the error between
the value function estimated by the neural network and the true value function within a
reasonable interval. This advantage is particularly useful when facing the coupling nature
of the 3-DOF translational and 3-DOF rotational motion of the rocket, as the value function
is difficult to accurately estimate in such scenarios. In addition, we designed a novel reward
function by introducing a state-dependent Gaussian distribution. The landing result was
evaluated by considering the landing position error, landing velocity, and landing impact
related to the final acceleration. To verify the effectiveness of our proposed method, we built
a three-dimensional simulation environment considering gravity, thrust, and aerodynamic
drag to test the trained landing control policy. Our contribution are listed as follows:
1. We introduced RL to obtain the optimal control commands for the reusable rocket
landing problem. We apply the SD3 algorithm to simultaneously predict both the
optimal thrust and its nozzle angle considering the three-axis translational and three-
Aerospace 2023, 10, 590 3 of 16
3µ ae 2
gm = − J2 ( ) sin Φ
||r ||2 ||r ||
1
D= ρ||v||Sre f CD Mav (2)
2
where ρ is the atmospheric density determined by the height, v is the velocity of the rocket
relative to the atmosphere, Sre f is the reference area of the rocket, and Ma is the Mach
number. CD is the drag coefficient, which is a non-linear function of the velocity v. Detailed
calculation of aerodynamic drag D is discussed in Appendix B.
The rotational movement of the rocket is governed by the attitude dynamics equation
expressed as
dω
I b + ωb × I · ωb = M (3)
dt
Aerospace 2023, 10, 590 4 of 16
where ωb = [ωx ωy ωz ]> is the angular velocity of the rocket, M is the external torque on
the rocket, and I is the inertia matrix of the rocket. The torque caused by aerodynamic drag
is relatively small compared to the torque generated by the engine, it is neglected in the
following discussion. The shape of the rocket is simplified and defined as a cylinder with
radius r and height h. Therefore, the inertia matrix can be expressed as
1 2
2 mr 0 0
I= 0 1 2 + h2 ) . (4)
12 m ( 3r 0
1 2 2
0 0 12 m (3r + h )
T
Rt = ∑ γi r ( si , ai ) (5)
i =0
RL aims to learn a policy that maximizes the expected return from the starting dis-
tribution J = Eri ,si ∼ E,ai ∼π [ R1 ]. Specifically, in the reusable rocket landing problem, policy
π refers to an actor network that can output thrust control commands based on states
such as velocity and position, and return refers to the reward given by the environment.
During the learning process, a rocket is randomly initialized repeatedly to land in a fixed
landing zone under the control of policy π, and the policy is updated according to the
environment-based rewards. The action-value function describes the expected return after
taking an action at in state st and thereafter following policy π:
As the first method to extend DQN (deep Q network) to a continuous control space,
DDPG is one of the standard algorithms for training an off-policy model-free RL network.
DDPG learns both a policy network and a value function by decomposing them into an
0
actor network π (s|θ π ), target actor network π 0 (s|θ π ), Q network Q(s, a|θ Q ) and target
0
Q network Q0 (s, a|θ Q ). The parameterized actor network learns to determine the best
action with the highest value estimates according to the critic network by policy gradient
descent. The parameters of the target network are periodically replicated from the ontology
to ensure the stability and convergence of training while alleviating the overestimation of Q.
As an off-policy algorithm, it takes advantage of a replay buffer D to store past experiences
and learn from them by randomly sampling from the replay buffer. The Q networks are
updated by minimizing the loss L:
0
L= E ( Q(s, a|θ Q ) − (r + γ max Q0 (s0 , a0 |θ Q )))2 (8)
(s,a,r,s0 )∼ D a0
Aerospace 2023, 10, 590 5 of 16
xb
zb
yb δ
Figure 1. The direction of the thrust vector T in the body coordinate system is determined by δy
and δz .
To simplify the outputs of the agent for better numerical stability, the control variables
are constrained to [−1, 1]. The actual actions taken by the rocket can then be defined as
[ a1 a2 a3 ], and their relations with the original control variables are
Tmax − Tmin Tmax + Tmin
|| T || = a1 +
2 2
δ = δ a (14)
y max 2
δz = δmax a3
Aerospace 2023, 10, 590 7 of 16
Since the torque caused by aerodynamic drag is relatively small, only the torque
generated by thrust is considered in this paper. The shape of the rocket is defined as a
cylinder with radius r and height h, then M can be expressed as
1 1
>
M= 0 2 h | T || sin δy 2 h || T || cos δy sin δz (16)
• Penalty for fuel consumption: The landing performance is evaluated by fuel consump-
tion. The penalty aims to minimize fuel consumption and is defined by
1
R f uel = λ f uel || T || (20)
Isp
• Reward for accurate landing: The ideal landing location is set to be r = [0, 0, 0]> .
In order to encourage the rocket to land as close to the ideal landing location as
possible, we construct the landing position reward function in the form of a Gaussian
distribution,
1 r2
Ri = √ exp − i 2 , i = x, y, z (21)
2πσi 2σi
where σi determines the scope of the reward given, and the closer the rocket is to the
ideal landing location, the bigger reward it will obtain. The total reward is:
1
R pos = λ pos ∑ Ri (22)
3 i = x,y,z
• Reward for Euler angle: Similar to the reward for velocity direction, we want to
stabilize the attitude motion of the rocket during landing and prefer to control the Euler
angle of the rocket to arrive as close to our predefined reference attitude φ = −90◦ ,
ψ = 0◦ and γ = 0◦ . The reward function can be expressed as:
1 (φ + 90◦ )2
Rφ = √ exp − (24)
2πσφ 2σφ2
1 ψ2
Rψ = √ exp − (25)
2πσψ 2σψ2
1 γ2
Rγ = √ exp − (26)
2πσγ 2σγ2
where σφ , σψ and σγ limit the range of rewards given, and the total reward is:
1
R angle = λ angle Rφ + Rψ + Rγ (27)
3
where λ angle is a hyperparameter to adjust the size of the reward.
Finally, the overall reward function is given in the form of a summation
the concatenation of state st and action at , and outputs the value Q(st , at ) . The network
architecture is shown in Figure 2.
Action Value
(a) Actor Critic Framework (b) Actor Network (c) Critic Network
3. Results
In this section, we will demonstrate the simulation used to verify our proposed method
and present the results.
Parameters Values
Launch azimuth A 190◦
Latitude of launch point B 41◦ N
Longitude of launch point λ 100◦ E
Mass of rocket m 35,000 kg
Radius of rocket r 3.7 m
Height of rocketh 40 m
Minimum thrust Tmin 250,000 N
Maximum thrust Tmax 750,000 N
Reference area of rocket Sre f 10.75 m2
Specific impulse Isp 345 s
Landing zone area 10 m × 10 m
Acceptable vertical landing velocity [−2.5, 0] m/s
Acceptable translational landing velocity [0, 1] m/s
Aerospace 2023, 10, 590 10 of 16
Parameters Values
Size of replay buffer |B| 106
Variance of policy noise σ 0.05
Maximum noise c 0.05
Discount factor γ 0.99
Learning rate of actor networks 5 × 10−5
Learning rate of critic networks 5 × 10−5
DDPG
TD3
SD3
Episode Reward
Epoch
Figure 3. The reward received during training.
We can see from the figure that rewards received by all three algorithms start to rise
at around 3000 epochs, indicating a significant improvement in the agent’s control ability.
All three rewards converge after 6000 episodes of training, meaning all three algorithms
demonstrate a sufficient control ability. Among the three algorithms, SD3 gained the highest
reward, reaching around 175.
To validate the effectiveness of the trained agent, we further select 100 initial simulation
cases randomly according to Tables 1 and 2 to conduct 100 landing tests. The landing
trajectories are shown in Figure 4. According to the figure, all three algorithms are capable
of guiding the rocket to approach the landing zone. However, when controlled by DDPG,
the rocket failed to reach the ground successfully. When controlled by SD3, the rocket
performs abnormal turning and lateral movement during the final landing stage. In
comparison, the rocket controlled by SD3 managed to reach the landing zone vertically,
smoothly and accurately.
Aerospace 2023, 10, 590 11 of 16
The comparison of landing points is shown in Figure 5. The mean and standard
deviation of the three sets of data are shown in the following Table 4. The DDPG agent
failed to guide the rocket to land on the ground, the points in the graph are the vertical
projection of the final position of the rocket on the ground surface. Although rockets
controlled by TD3 agents successfully landed on the ground, all their landing points are
out of the landing zone. The landing points controlled by SD3 achieved the best results
compared to DDPG and TD3, with no landing points located outside the landing zone.
The Comparison of landing points of the three methods are listed in Table 5. The results
validate the effectiveness of the proposed SD3-based method.
] P
] P
1000 200 1000 200 1000 200
Figure 4. Comparison of the landing trajectories. The view on the right of each graph is the enlarged
view of the final landing stage.
\ P
\ P
\ P
[ P [ P [ P
D / D Q G L Q J S R L Q W V R I E / D Q G L Q J S R L Q W V R I F / D Q G L Q J S R L Q W V R I
'