[go: up one dir, main page]

0% found this document useful (0 votes)
12 views16 pages

Aerospace 10 00590

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

aerospace

Article
Powered Landing Control of Reusable Rockets Based on
Softmax Double DDPG
Wenting Li 1,2 , Xiuhui Zhang 3 , Yunfeng Dong 3 , Yan Lin 1 and Hongjue Li 3,4, *

1 School of Automation Science and Electrical Engineering, Beihang University, Beijing 102206, China
2 Beijing Aerospace Automatic Control Institute, Beijing 100854, China
3 School of Astronautics, Beihang University, Beijing 102206, China
4 Shenzhen Institute of Beihang University, Shenzhen 518063, China
* Correspondence: lihongjue@buaa.edu.cn

Abstract: Multi-stage launch vehicles are currently the primary tool for humans to reach extraterres-
trial space. The technology of recovering and reusing rockets can effectively shorten rocket launch
cycles and reduce space launch costs. With the development of deep representation learning, rein-
forcement learning (RL) has become a robust learning framework capable of learning complex policies
in high-dimensional environments. In this paper, a deep reinforcement learning-based reusable rocket
landing control method is proposed. The mathematical process of reusable rocket landing is modelled
by considering the aerodynamic drag, thrust, gravitational force, and Earth’s rotation during the
landing process. A reward function is designed according to the rewards and penalties derived from
mission accomplishment, terminal constraints, and landing performance. Based on this, the Softmax
double deep deterministic policy gradient (SD3) deep RL method is applied to build a robust reusable
rocket landing control method. In the constructed simulation environment, the proposed method can
achieve convergent and robust control results, proving the effectiveness of the proposed method.

Keywords: reusable rocket; powered landing; reinforcement learning; vertical recycling

1. Introduction
Citation: Li, W.; Zhang, X.; Dong, Y.; With the expanding demand for launching satellites and spacecraft, expendable launch
Lin, Y.; Li, H. Powered Landing vehicles are becoming one of the main factors restricting the exploration of extraterrestrial
Control of Reusable Rockets Based on space. Expendable launch vehicles discard critical components such as engine and naviga-
Softmax Double DDPG. Aerospace tion equipment directly after completing their missions, raising the cost of space access [1].
2023, 10, 590. https://doi.org/ To tackle the above problems, reusable launch vehicles have gained widespread attention
10.3390/aerospace10070590 regarding their economy and flexibility [2–5].
Academic Editor: Gokhan Inalhan Research in reusable launch vehicles has made continuous progress in recent years.
As an extension of parachute landing technology, the parachute recovery method was
Received: 14 April 2023 widely studied and applied in various rockets [6–9]. Although it is a simple, reliable, and
Revised: 12 June 2023 economical method, the unpowered descent process makes it difficult to predict the final
Accepted: 27 June 2023
landing point of the rocket. As a comparison, the powered landing control method has
Published: 28 June 2023
gradually become the preferred option considering the safety, controllability, and rapidity
of the landing process. Academic research on powered landing generally apply offline or
online trajectory optimization followed by various tracking control methods for pinpoint
Copyright: © 2023 by the authors.
landing. Convex optimization is one of the typical methods to transform the optimal
Licensee MDPI, Basel, Switzerland. landing problem into a convex optimization problem where an optimal landing trajectory
This article is an open access article can be obtained [10–12]. Açıkmeşe et al. [10] introduced a convexification of the control
distributed under the terms and constraints that was proven to be lossless and used interior methods of convex optimization
conditions of the Creative Commons to obtain optimal solutions of the original non-convex optimal control problem. Blackmore
Attribution (CC BY) license (https:// [11] combined convex optimization by CVXGEN [13] with powered divert guidance to opti-
creativecommons.org/licenses/by/ mize trajectory and eliminate error. While convex-optimization-based methods can achieve
4.0/). optimal solutions, they require the problem to be convex and often become computationally

Aerospace 2023, 10, 590. https://doi.org/10.3390/aerospace10070590 https://www.mdpi.com/journal/aerospace


Aerospace 2023, 10, 590 2 of 16

inefficient as the complexity of the problem increases. Besides convex optimization, many
other numerical optimization methods can also be applied to solve the optimal landing
problem. Sagliano et al. [14] applied a multivariate pseudospectral interpolation scheme to
subspaces of a database of pre-computed trajectories to generate real-time capable entry
guidance solutions. Seelbinder et al. [15] introduced parametric sensitivity analysis of
non-linear programs for near-optimal control sequence generation. To take advantage
of both convex optimization and pseudospectral methods, Wang et al. [16] adopted a
pseudospectral-improved successive convexification algorithm to solve the trajectory opti-
mization problem. The above-mentioned methods greatly enriched the solution diversity of
the optimal landing problem, but they usually require extensive calculations and accurate
modelling of the environment, which are difficult to satisfy in practice.
Leveraging the continuous development and breakthrough of deep learning tech-
nology, the idea of using deep neural networks (DNN) in the guidance, navigation, and
control has been widely explored in the literature [17–20]. Among all deep learning tech-
nologies, supervised learning has been extensively discussed. DNNs have been applied
to asteroid landing trajectory optimization [17] and first-stage recovery flight of reusable
launch vehicles [18]. Furfaro et al. [19,20] designed a set of convolutional neural networks
(CNN) and long-short-term memory (LSTM) architectures to handle the problem of the
autonomous lunar landing. Although supervised learning exhibits strong fitting abilities
for optimal landing controls, acquiring sufficient data required for the training process
is very time-consuming [17,19]. An optimization method (e.g., GPOPS [19]) is usually
employed before supervised learning to generate optimal state-control pairs as training
samples (146,949 samples in [17] and 2,970,000 samples in [19]). Depending on the com-
plexity of the optimization problem, a single sample can take anywhere from a few seconds
to several hours to compute, and generating such a large number of samples consumes
considerable time.
Deep reinforcement learning (DRL), on the other hand, uses trial-and-error interaction
with the environment to optimize the policy instead of learning from enormous labelled
data, thus making it useful in many fields such as chess [21], game [22,23], investment [24],
and autonomous driving [25,26]. DRL has also drawn extensive attention to reusable launch
vehicles. Ciabatti et al. [27] combined a deep deterministic policy gradient (DDPG) [28]
algorithm with transfer learning to tackle the problem of planetary landing on unknown
extra-terrestrial environments. Cheng et al. [29] derived an interactive reinforcement learn-
ing (RL) algorithm for the Moon fuel-optimal landing problem. Furfaro et al. [30] presented
an adaptive guidance algorithm based on zero-effort-miss/zero-effort-velocity (ZEM/ZEV)
and deterministic policy gradient [31]. Gaudet et al. [32] applied the principles of the
deep Q network [33] into a six degrees-of-freedom Mars landing simulation environment,
showing strong robustness.
In this paper, we present the final landing stage of a two-stage reusable rocket con-
trolled by using DRL algorithms. In contrast with existing applied RL algorithms, we apply
the Softmax double deep deterministic policy gradient (SD3) to keep the error between
the value function estimated by the neural network and the true value function within a
reasonable interval. This advantage is particularly useful when facing the coupling nature
of the 3-DOF translational and 3-DOF rotational motion of the rocket, as the value function
is difficult to accurately estimate in such scenarios. In addition, we designed a novel reward
function by introducing a state-dependent Gaussian distribution. The landing result was
evaluated by considering the landing position error, landing velocity, and landing impact
related to the final acceleration. To verify the effectiveness of our proposed method, we built
a three-dimensional simulation environment considering gravity, thrust, and aerodynamic
drag to test the trained landing control policy. Our contribution are listed as follows:
1. We introduced RL to obtain the optimal control commands for the reusable rocket
landing problem. We apply the SD3 algorithm to simultaneously predict both the
optimal thrust and its nozzle angle considering the three-axis translational and three-
Aerospace 2023, 10, 590 3 of 16

axis rotational dynamics of the rocket as well as environmental disturbances and


uncertainties.
2. We designed a novel reward function specific for the reusable rocket landing problem
during the training of the RL models. We aim to give sufficient guidance to the rocket
in terms of position, velocity, thrust, and attitude angle. We introduced Gaussian
distribution rewards for position and attitude angle to give higher rewards under the
condition of a higher probability of a successful landing.

2. Materials and Methods


2.1. Mathematical Modelling of the Reusable Rocket Landing
This paper focuses on the power landing stage of the reusable rocket landing process,
where the position and attitude of the rocket are controlled by adjusting the thrust mag-
nitude and nozzle angle. Four coordinate systems are used in this paper: the geocentric
equatorial coordinate system, the launch point coordinate system, the body coordinate
system of the rocket, and the velocity coordinate system of the rocket. Their definitions and
relationships are derived in Appendix A.
The translational movement of the rocket is governed by the following translational
dynamics equations expressed in the launch point coordinate system:


 ṙ = v

T+D


v̇ = + g − ae − ak

m (1)
1


ṁ = − || T ||



Isp

where r = [ x y z]> ∈ R3 is the position vector, v = [v x vy vz ]> ∈ R3 is the velocity vector,


m is the mass of the rocket, T is the thrust vector, D is the aerodynamic drag vector, g
is the acceleration of gravity, and Isp is the specific impulse of the rocket. We define ωe
as the angular velocity of the Earth’s rotation, then ae = ωe × (ωe × r ) is the convected
acceleration, and ak = 2ωe × δr δt is the Coriolis acceleration.
The acceleration of gravity g can be decomposed into two components considering
the oblateness of the Earth, i.e., the vertical component gr pointing from the Earth’s centre
to the rocket’s centre of mass, and the horizontal component gm pointing from the rocket’s
mass centre to North
3 ae 2

µ 2
 gr = − ||r ||2 [1 + 2 J2 ( ||r || ) (1 − 5 sin Φ)]

3µ ae 2
 gm = − J2 ( ) sin Φ


||r ||2 ||r ||

where µ = 3.986004 × 1014 is the geocentric gravitational constant, J2 = 1.08263 × 10−3 is


the oblateness of the Earth, and Φ is the latitude of the rocket.
The aerodynamic drag D is described as

1
D= ρ||v||Sre f CD Mav (2)
2
where ρ is the atmospheric density determined by the height, v is the velocity of the rocket
relative to the atmosphere, Sre f is the reference area of the rocket, and Ma is the Mach
number. CD is the drag coefficient, which is a non-linear function of the velocity v. Detailed
calculation of aerodynamic drag D is discussed in Appendix B.
The rotational movement of the rocket is governed by the attitude dynamics equation
expressed as

I b + ωb × I · ωb = M (3)
dt
Aerospace 2023, 10, 590 4 of 16

where ωb = [ωx ωy ωz ]> is the angular velocity of the rocket, M is the external torque on
the rocket, and I is the inertia matrix of the rocket. The torque caused by aerodynamic drag
is relatively small compared to the torque generated by the engine, it is neglected in the
following discussion. The shape of the rocket is simplified and defined as a cylinder with
radius r and height h. Therefore, the inertia matrix can be expressed as
1 2 
2 mr 0 0
I= 0 1 2 + h2 ) . (4)
12 m ( 3r 0
1 2 2
0 0 12 m (3r + h )

2.2. SD3 Reinforcement Learning Algorithm


The RL problem can be formulated by a Markov decision process (MDP). We consider
a standard RL setup consisting of an agent interacting with an environment in discrete
time steps. In the powered landing problem, the agent is considered as the rocket itself.
At each time step t, the agent receives an observation st , takes an action at and receives a
scalar reward rt (st , at ). An agent’s behaviour is defined by a policy π (st ), which outputs
the action at based on the state st . The environment changes from one stage to another
according to the transition probability p(st+1 |st , at ). The return of an episode R is defined
as the sum of discounted future reward with a discounting factor γ ∈ [0, 1]

T
Rt = ∑ γi r ( si , ai ) (5)
i =0

RL aims to learn a policy that maximizes the expected return from the starting dis-
tribution J = Eri ,si ∼ E,ai ∼π [ R1 ]. Specifically, in the reusable rocket landing problem, policy
π refers to an actor network that can output thrust control commands based on states
such as velocity and position, and return refers to the reward given by the environment.
During the learning process, a rocket is randomly initialized repeatedly to land in a fixed
landing zone under the control of policy π, and the policy is updated according to the
environment-based rewards. The action-value function describes the expected return after
taking an action at in state st and thereafter following policy π:

Qπ (st , at ) = Eri≥t ,si>t ∼ E,ai>t ∼π [rt (st , at )] (6)

According to the Bellman equation, the action-value function can be expressed as

Qπ (st , at ) = Ert ,st+1 ∼ E [rt (st , at ) + γ max Qπ (st+1 , at+1 )] (7)


a t +1

As the first method to extend DQN (deep Q network) to a continuous control space,
DDPG is one of the standard algorithms for training an off-policy model-free RL network.
DDPG learns both a policy network and a value function by decomposing them into an
0
actor network π (s|θ π ), target actor network π 0 (s|θ π ), Q network Q(s, a|θ Q ) and target
0
Q network Q0 (s, a|θ Q ). The parameterized actor network learns to determine the best
action with the highest value estimates according to the critic network by policy gradient
descent. The parameters of the target network are periodically replicated from the ontology
to ensure the stability and convergence of training while alleviating the overestimation of Q.
As an off-policy algorithm, it takes advantage of a replay buffer D to store past experiences
and learn from them by randomly sampling from the replay buffer. The Q networks are
updated by minimizing the loss L:
0
L= E ( Q(s, a|θ Q ) − (r + γ max Q0 (s0 , a0 |θ Q )))2 (8)
(s,a,r,s0 )∼ D a0
Aerospace 2023, 10, 590 5 of 16

The actor network is updated by policy gradient:

∇θ π J ≈ E ∇ a Q(s, a|θ Q ) (9)


(s,a,r,s0 )∼ D
TD3 (Twin-delayed deep deterministic policy gradient) is developed based on DDPG
with three main improvements: clipped double Q learning, delayed policy updates and
target policy smoothing.
• Clipped double Q
As an extension of DQN in continuous control, DDPG inherits the over-estimation
problem: the estimated Q value function is larger than the actual Q value function.
To alleviate this problem, TD3 uses two Q value function networks at the same time,
while in the update, the smaller value of the two networks is selected to calculate the
updating target for the value function

y(r, s0 , d) = r + γ(1 − d) min QΦi,arg (s0 , a0 (s0 ))


i =1,2

• Delayed policy updates


Since an unstable value network can lead to a blind update of the policy network, it
should be updated at a lower frequency than the value network. In this case, the error
of the target networks is reduced to avoid updating the policy updates on high-error
states which can cause divergent behaviour.
• Target policy smoothing
Due to the deterministic policy, the update target of the Q network is more susceptible
to the estimation error. TD3 introduces a regularization strategy for deep value
learning in calculating the target Q, adding a small amount of random noise to the
target policy and averaging over mini-batches.

y = r + γQθ 0 (s0 , πφ0 (s0 ) + e), e ∼ clip(N (0, σ ), −c, c) (10)


This updated style can also lead to improvement in the stochastic domains with failure
cases, making the estimation of Q more accurate and robust.
In this paper, we use the SD3 algorithm to alleviate the underestimation generated
by directly applying the double Q-learning algorithm. A key component in TD3 is the
clipped double Q-learning algorithm, which significantly reduces the overestimation but
also leads to a large underestimation bias. To handle this problem, SD3 investigates the use
of a Softmax operator in updating value functions in actor–critic methods for continuous
control. It proposes to estimate the target value for the critic Qi by

yi = r + γT SD3 (s0 ) (11)


where
T SD3 (s0 ) = so f tmax β Q̂i (s0 , ·)

(12)

Q̂i (s0 , a0 ) = min Qi (s0 , a0 ; θi− ), Q−i (s0 , a0 ; θ−




i) (13)
By applying the Boltzmann Softmax operator for value function estimation, SD3 helps
to smooth the optimization landscape and improve the underestimation bias of TD3. The
SD3-based rocket recycling algorithm is shown in Algorithm 1.
Aerospace 2023, 10, 590 6 of 16

Algorithm 1 SD3-based rocket recycling.


1: Initialize critic networks Q1 , Q2 , and actor networks π1 , π2 with random parameters
θ1 , θ2 , φ1 , φ2
2: Initialize target networks θ1− ← θ1 , θ2− ← θ2 , φ1− ← φ1 , φ2− ← φ2
3: Initialize replay buffer B
4: for episode t = 1 to T do
5: Observe a rocket state s
6: Select a thruster action a with exploration noise e
7: Execute action a, observe reward r, new state si and done d
8: Store transition tuple (s, a, r, s0 , d) in B
9: while it’s time for update do
10: Sample a mini-batch of N transitions (s, a, r, s0 , d) from B
11: Sample K noises e ∼ N (0, σ̄ )
12: â0 ← πi (s0 ; φi− ) + clip(e, −c, c)
13: Q̂(s0 , â0 ) ← min j=1,2 ( Q j (s0 , â0 ; θ j− ))
" # " #
0 exp( β Q̂(s0 ,â0 )) Q̂(s0 ,â0 ) exp( β Q̂(s0 ,â0 ))
14: so f tmax β ( Q̂i (s , ·)) ← Eâ0 ∼ p p( â0 )
Eâ0 ∼ p p( â0 )

yi ← r + γ(1 − d)so f tmax β Q̂i (s0 , ·)



15:
2
16: Update the critic θi according to Bellman loss: N1 ∑s Qi (s, a; θi ) − yi
Update actor φi by policy gradient: N1 ∑s ∇φi (π (s; φi ))∇ a Qi (s, a; θi )| a=π (s;φi )
 
17:
18: Update target networks: θi− ← τθi + (1 − τ )θi− , φi− ← τφi + (1 − τ )φi−
19: end while
20: end for

2.3. Reinforcement Learning for Rocket Landing


In this section, we set up RL for reusable rocket landing, including action space,
observation space, reward function, and network structure.

2.3.1. Action Space


The rocket is equipped with an engine with a deflectable nozzle, as is shown in
Figure 1. The angle between T and the plane Ob xb zb is defined as δy , and the angle between
the projection of T on plane Ob xb zb and the opposite direction of Ob xb is defined as δz .
Therefore, the control variables of the rocket are || T ||, δy , and δz .

xb

zb
yb δ

Figure 1. The direction of the thrust vector T in the body coordinate system is determined by δy
and δz .

To simplify the outputs of the agent for better numerical stability, the control variables
are constrained to [−1, 1]. The actual actions taken by the rocket can then be defined as
[ a1 a2 a3 ], and their relations with the original control variables are
 Tmax − Tmin Tmax + Tmin
|| T || = a1 +

2 2

δ = δ a (14)

 y max 2

δz = δmax a3
Aerospace 2023, 10, 590 7 of 16

The thrust vector in the body coordinate system Tb can be expressed as


 >
Tb = || T || cos δy cos δz || T || sin δy || T || cos δy sin δz (15)

Since the torque caused by aerodynamic drag is relatively small, only the torque
generated by thrust is considered in this paper. The shape of the rocket is defined as a
cylinder with radius r and height h, then M can be expressed as
 1 1
>
M= 0 2 h | T || sin δy 2 h || T || cos δy sin δz (16)

2.3.2. Observation Space


To make sure the agent receives enough information about the current states to make
decisions during the reusable rocket landing process, we carefully design the information
the agent can obtain from the environment, including normalized position information,
velocity information, attitude angle information, and control information. Note that all the
actions are obtained from the previous time step. Besides traditional velocity information,
we introduce extra information for observation, i.e., the angle θv between the rocket’s
velocity and a reference direction, to better stabilize the landing control process. The
reference direction is defined to point from the rocket’s centre of mass to the ideal landing
position. All angles except the roll angle are normalized by sine functions, whereas the roll
angle requires both sine and cosine functions to cover [0◦ , 360◦ ]. The observation space at
time t can be expressed as
r x ry rz v x vy vz
st = [ sinθv sin(φ + 90◦ ) sin ψ cos ψ sin γ a1 a2 a3 ]t (17)
||r || ||r || ||r || ||v|| ||v|| ||v||

2.3.3. Reward Function


The goal of the agent is to maximize the received reward, and a well-designed reward
function can guide the agent to learn more robustly. We design a reward function including
a successful landing reward, a failed landing penalty, a location-based reward, a velocity-
based reward, and an angle-based reward.
• Reward for successful landing: The agent receives a reward when the rocket suc-
cessfully lands at the target location. The conditions for successful landing can be
expressed as C1 , which includes | x | ≤ xlim , 0 ≤ |y| ≤ ylim , |z| ≤ zlim , kvv k < vvlim ,
kvh k < vhlim , |φ| ≤ φlim , |ψ| ≤ ψlim , and |γ| ≤ γlim . The reward can be defined
as follows: 
λsuccess , C1 satis f ied
Rsuccess = (18)
0, C1 not satis f ied
where λsuccess is the positive constant hyperparameter, xlim = zlim = 10 m are the
length and width of the landing zone, ylim = 0.5 m is the upper bound of the landing
height, vvlim = 2 m/s and vhlm = 1 m/s are the upper bounds of the vertical and
horizontal velocity, respectively, φlim = ψlim = γlim = 10◦ are the upper bounds of the
rocket’s landing Euler angle. Although there is no need to limit ψ, it is limited in the
simulation.
• Penalty for failed landing: The agent will receive a penalty if the landing fails, includ-
ing an excessive attitude angle or deviation from the landing zone, and the simulation
will be terminated. A failed landing is that exact opposite of a successful landing. The
penalty can be expressed as:

λ f ail , C1 not satis f ied
R f ail = (19)
0, C1 satis f ied
Aerospace 2023, 10, 590 8 of 16

• Penalty for fuel consumption: The landing performance is evaluated by fuel consump-
tion. The penalty aims to minimize fuel consumption and is defined by

1
R f uel = λ f uel || T || (20)
Isp

• Reward for accurate landing: The ideal landing location is set to be r = [0, 0, 0]> .
In order to encourage the rocket to land as close to the ideal landing location as
possible, we construct the landing position reward function in the form of a Gaussian
distribution,
1 r2 
Ri = √ exp − i 2 , i = x, y, z (21)
2πσi 2σi
where σi determines the scope of the reward given, and the closer the rocket is to the
ideal landing location, the bigger reward it will obtain. The total reward is:

1
R pos = λ pos ∑ Ri (22)
3 i = x,y,z

where λ pos is a hyperparameter to adjust the size of the reward.


• Reward for velocity direction: We construct a reward for velocity direction to stabilize
the translational flight of the rocket. The reference direction is defined to point from
the rocket’s centre of mass to the ideal location, and the angle θv between the current
velocity and the reference direction is used as a guidance for learning. The smaller θv
is, the bigger reward the rocket will obtain. The reward can be described as:

Rvel = λvel cos θv (23)

• Reward for Euler angle: Similar to the reward for velocity direction, we want to
stabilize the attitude motion of the rocket during landing and prefer to control the Euler
angle of the rocket to arrive as close to our predefined reference attitude φ = −90◦ ,
ψ = 0◦ and γ = 0◦ . The reward function can be expressed as:

1 (φ + 90◦ )2 
Rφ = √ exp − (24)
2πσφ 2σφ2

1 ψ2 
Rψ = √ exp − (25)
2πσψ 2σψ2

1 γ2 
Rγ = √ exp − (26)
2πσγ 2σγ2
where σφ , σψ and σγ limit the range of rewards given, and the total reward is:

1 
R angle = λ angle Rφ + Rψ + Rγ (27)
3
where λ angle is a hyperparameter to adjust the size of the reward.
Finally, the overall reward function is given in the form of a summation

R = Rsuccess + R f ail + R f uel + R pos + Rvel + R angle (28)

2.3.4. Network Architecture


We apply the widely used actor–critic framework to construct the RL algorithm. The
agent contains two types of networks: the actor network and the critic network. The actor
network takes state st as the input and outputs the action at . The critic network takes
Aerospace 2023, 10, 590 9 of 16

the concatenation of state st and action at , and outputs the value Q(st , at ) . The network
architecture is shown in Figure 2.

State State Action


Actor
Network FC layer with 256 neurons
state Concatenate
Tanh Activation Function

FC layer with 256 neurons FC layer with 256 neurons


Tanh Activation Function ReLU Activation Function
Critic
Network
FC layer with 256 neurons FC layer with 256 neurons
reward Tanh Activation Function ReLU Activation Function

action FC layer with 256 neurons FC layer with 256 neurons


Tanh Activation Function ReLU Activation Function
Environment

Action Value

(a) Actor Critic Framework (b) Actor Network (c) Critic Network

Figure 2. Network framework and architecture.

3. Results
In this section, we will demonstrate the simulation used to verify our proposed method
and present the results.

3.1. Environmental Settings


The values of the simulation parameters are listed in Table 1. The overall training
process is cut into episodes. In each episode, the position r x , ry , rz , velocity v x , vy , vz , and
attitude angle φ, ψ, γ of the rocket are initialized randomly according to Table 2. The
hyperparameters of the SD3 RL are shown in Table 3.

Table 1. Parameters of the simulation environment.

Parameters Values
Launch azimuth A 190◦
Latitude of launch point B 41◦ N
Longitude of launch point λ 100◦ E
Mass of rocket m 35,000 kg
Radius of rocket r 3.7 m
Height of rocketh 40 m
Minimum thrust Tmin 250,000 N
Maximum thrust Tmax 750,000 N
Reference area of rocket Sre f 10.75 m2
Specific impulse Isp 345 s
Landing zone area 10 m × 10 m
Acceptable vertical landing velocity [−2.5, 0] m/s
Acceptable translational landing velocity [0, 1] m/s
Aerospace 2023, 10, 590 10 of 16

Table 2. Value ranges of the initial status of the rocket.

Initial States of the Rocket Value Ranges


X-axis component of the initial position r x [−500, 500] m
Y-axis component of the initial position ry 2500 m
Z-axis component of the initial position rz [−500, 500] m
X-axis component of the initial velocity v x [−5, 5] m/s
Y-axis component of the initial velocity vy −230 m/s
Z-axis component of the initial velocity vz [−5, 5] m/s
Initial pitch angle φ [−95, −85]◦
Initial yaw angle ψ [−5, 5]◦
Initial roll angle γ 0◦

Table 3. Hyperparameters of the SD3 RL algorithm.

Parameters Values
Size of replay buffer |B| 106
Variance of policy noise σ 0.05
Maximum noise c 0.05
Discount factor γ 0.99
Learning rate of actor networks 5 × 10−5
Learning rate of critic networks 5 × 10−5

3.2. Training Results


We apply DDPG, TD3, and SD3 separately to train our agents and compare their
landing control performance. Their reward curves are shown in Figure 3.


DDPG
 TD3
SD3
Episode Reward










        
Epoch
Figure 3. The reward received during training.

We can see from the figure that rewards received by all three algorithms start to rise
at around 3000 epochs, indicating a significant improvement in the agent’s control ability.
All three rewards converge after 6000 episodes of training, meaning all three algorithms
demonstrate a sufficient control ability. Among the three algorithms, SD3 gained the highest
reward, reaching around 175.
To validate the effectiveness of the trained agent, we further select 100 initial simulation
cases randomly according to Tables 1 and 2 to conduct 100 landing tests. The landing
trajectories are shown in Figure 4. According to the figure, all three algorithms are capable
of guiding the rocket to approach the landing zone. However, when controlled by DDPG,
the rocket failed to reach the ground successfully. When controlled by SD3, the rocket
performs abnormal turning and lateral movement during the final landing stage. In
comparison, the rocket controlled by SD3 managed to reach the landing zone vertically,
smoothly and accurately.
Aerospace 2023, 10, 590 11 of 16

The comparison of landing points is shown in Figure 5. The mean and standard
deviation of the three sets of data are shown in the following Table 4. The DDPG agent
failed to guide the rocket to land on the ground, the points in the graph are the vertical
projection of the final position of the rocket on the ground surface. Although rockets
controlled by TD3 agents successfully landed on the ground, all their landing points are
out of the landing zone. The landing points controlled by SD3 achieved the best results
compared to DDPG and TD3, with no landing points located outside the landing zone.
The Comparison of landing points of the three methods are listed in Table 5. The results
validate the effectiveness of the proposed SD3-based method.

2500 2500 2500

2000 2000 2000

1500 1500 1500


] P

] P

] P
1000 200 1000 200 1000 200

500 0 500 0 500 0


100 100 100
100 100 100
0 100 0 100 0 100
500 500 500
0 0 50 0 0 50 0 0 50
0 0 0
\ P 50 \ P 50 \ P 50
0 [ P 0 [ P 0 [ P
(a) Trajectories controlled (b) Trajectories controlled (c) Trajectories controlled
by DDPG agent. by TD3 agent. by SD3 agent.

Figure 4. Comparison of the landing trajectories. The view on the right of each graph is the enlarged
view of the final landing stage.


 

  


\ P

\ P

\ P

 
  

 

       
[ P [ P [ P
D /DQGLQJSRLQWVRI E /DQGLQJSRLQWVRI F /DQGLQJSRLQWVRI
''3*DJHQW 7'DJHQW 6'DJHQW

Figure 5. Comparison of landing points. The grey zone is the desired 10 m × 10 m landing zone. The
red points lie outside of the landing zone, the blue points are within the landing zone.

Table 4. Comparison of the landing points.

Landing Point
DDPG TD3 SD3
Coordinates
Average (5.534, −16.955) m (−5.795, −9.316) m (1.074, −5.133) m
Standard deviation (11.563, 17.219) m (8.746, 12.064) m (4.173, 2.987) m
Best (−1.322, −1.549) m (−2.316, −0.248) m (−0.217, −0.534) m
Worst (−13.712, −46.027) m (−23.674, −25.946) m (−2.387, −9.976) m
Aerospace 2023, 10, 590 12 of 16

Table 5. Comparison of the reward function components for the different algorithms.

Reward Function Components DDPG TD3 SD3


R f uel −0.147 −0.142 −0.118
R pos 0.318 0.348 0.368
Rvel 0.441 0.454 0.469
R angle 0.131 0.135 0.138

Figures 6–8 show the detailed landing status and control commands of the 26th
trajectory controlled by the three algorithms, respectively.

50
YHORFLW\ PV

100

150
vx
200 vy
vz
0 10 20 30 40 0 5 10 15 20 25 0 5 10 15 20 25
WLPH V WLPH V WLPH V
(a) DDPG controlled landing velocity. (b) TD3 controlled landing velocity. (c) SD3 controlled landing velocity.

Figure 6. Comparison of the landing velocities.

20
(XOHUDQJOH

40 roll
pitch
yaw
60

80

100
0 10 20 30 40 0 5 10 15 20 25 0 5 10 15 20 25
WLPH V WLPH V WLPH V
(a) DDPG controlled landing Euler angle. (b) TD3 controlled landing Euler angle. (c) SD3 controlled landing Euler angle.

Figure 7. Comparison of the rocket Euler angle.

100% 100% 100%


WKUXVWWKURWWOH

67% 67% 67%

33% 33% 33%


3 3 3
GHIOHFWLRQDQJOH

0 0 0
angle z( )
angle y( )
-3 -3 -3
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
WLPH V WLPH V WLPH V
(a) DDPG controlled landing command. (b) TD3 controlled landing command. (c) SD3 controlled landing command.

Figure 8. Comparison of the landing control commands.

4. Discussion
In RL, an optimal policy is linked to the optimal value function. When the reward
function is fixed, the obtained landing policy is directly related to the estimation of the
value function.
The DDPG method uses the Bellman equation that tends to overestimate the value
function. Under this circumstance, the agents are more likely to explore areas with low
rewards, thus making the training process unstable.
The TD3 method uses two value networks with the same structure but different learned
parameters to estimate the value function. It takes the smaller value as the estimation result.
Aerospace 2023, 10, 590 13 of 16

This operation can avoid overestimation of the value function, but introduces the problem
of underestimation. Therefore, although the training process of the TD3 algorithm is
more stable compared to DDPG, the convergence speed of the algorithm is slower and the
algorithm converges easily to areas with lower rewards, making the agents land poorly.
In comparison, the SD3 method adopts Softmax operation on the state-value function,
making the error between the estimation result and true value within a controllable range.
A better estimation of the state-value function leads to a better landing performance.

5. Conclusions
In this paper, an SD3-based RL method is proposed with a novel designed reward
function to solve the reusable rocket landing control problem. Benefiting from the ac-
curate estimation of the state-value function, the proposed method outperforms widely
used DDPG and TD3 algorithms and has been proven to be effective through numerical
simulation conducted on 100 randomly selected cases.

Author Contributions: Conceptualization, Y.L. and H.L.; methodology, H.L.; software, W.L.; val-
idation, W.L., X.Z. and H.L.; formal analysis, X.Z.; investigation, W.L. and X.Z.; resources, Y.D.;
data curation, X.Z.; writing—original draft preparation, W.L.; writing—review and editing, X.Z.;
visualization, H.L.; supervision, Y.D. and Y.L.; project administration, H.L.; funding acquisition, H.L.
All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by Guangdong Basic and Applied Basic Research Foundation
grant number 2022A1515110837.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author. The data are not publicly available due to privacy.
Conflicts of Interest: The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

DDPG Deep deterministic policy gradient


TD3 Twin-delayed deep deterministic policy gradient
SD3 Softmax double deep deterministic policy gradient

Appendix A
Four coordinate systems are used in this work, namely the geocentric equatorial
coordinate system, the launch point coordinate system, the body coordinate system of the
rocket and the velocity coordinate system of the rocket. The relationships between them
are described below.
The geocentric equatorial coordinate system Oe xe ye ze : The origin Oe is located at the
centre of the Earth, the x-axis points to the intersection of the prime meridian and the
equatorial plane, the z-axis points to the direction of the Earth’s North pole, and the y-axis
direction satisfies the requirements of the right-handed coordinate system.
The launch point coordinate system Oo xo yo zo : The origin Oo is at the launch point of
the rocket, and the Earth is considered an ellipsoid, with the y-axis coinciding with the
principal normal of the ellipsoidal plane at the origin of the coordinates and the x-axis
pointing in the launch direction. In this paper, since the study is about reusable rocket
landing, the landing target point is set as the origin of the launch point coordinate system
for the convenience of describing the study. The angle between the y-axis and the equatorial
plane is noted as latitude B, and the longitude where the launch point is located is noted
as λ. In the horizontal plane, the angle between the x-axis and direct North of the local
meridian tangent is the launch azimuth, noted as A.
The body coordinate system of the rocket Ob xb yb zb : The origin Ob is at the rockets’
centre of mass, and the x-axis is along the longitudinal axis of the rocket, pointing in the
direction of the tail of the rocket. The z-axis is perpendicular to the main symmetry plane
Aerospace 2023, 10, 590 14 of 16

of the rocket, and it points to the right when viewed along the direction of motion. It is
used to describe the rocket’s attitude.
The velocity coordinate system of the rocket Ov xv yv zv : The origin Ov is at the rocket’s
centre of mass, the x-axis is along the direction of the rocket’s velocity, the z-axis is perpen-
dicular to the main plane of symmetry of the rocket, and the z-axis points to the right when
viewed along the direction of motion. It is used for the calculation of aerodynamic drag.

(a) (b) (c)

B

Figure A1. The relationships between the coordinate systems. (a) The rotation from the geocentric
equatorial system Oe xe ye ze to the launch point coordinate system Oo xo yo zo . (b) The rotation from
the launch point system Oo xo yo zo to the body coordinate system Ob xb yb zb . (c) The rotation from the
body coordinate system Ob xb yb zb to the velocity coordinate system Ov xv yv zv .

The relationship of the coordinate systems can be described using rotation matrices.
The geocentric equatorial coordinate system is rotated by −(90◦ − λ) around the z-axis,
then by B around the x-axis, and finally by −(90◦ + A) around the y-axis to coincide with
the launch point coordinate system. The rotation matrix Ceo can be written as

Ceo = Ly (−(90◦ + A)) L x ( B) Lz (−(90◦ − λ)) (A1)


The rotation from the launch point coordinate system Oo xo yo zo to the body coordinate
system Ob xb yb zb is defined by three angles: the pitch angle φ, the yaw angle ψ and the roll
angle γ. We define the angle between the projection of the Ob xb in Oo xo yo planes and the
Oo xo as φ, the angle between the Ob xb and Oo xo yo planes as ψ, and the rocket’s angle of
rotation around Ob xb as γ. The launch point coordinate system is rotated by φ around the
z-axis, then by ψ around the y-axis, and finally by γ around the x-axis to coincide with the
body coordinate system. The rotation matrix Cob can be written as

Cob = L x (γ) Ly (ψ) Lz (φ) (A2)


The rotation from the body coordinate system Ob xb yb zb to the velocity coordinate
system Ov xv yv zv is defined by two angles: the angle of attack α and the angle of sideslip
β. We define the angle between the projection of Ov xv in the Ob xb yb plane and the Ob xb as
α, and the angle between the Ov xv and Ob xb yb planes as β. The body coordinate system
is rotated by α around the z-axis and by β around the y-axis to coincide with the velocity
coordinate system. The rotation matrix Cbv can be written as

Cbv = Ly ( β) Lz (α) (A3)

Appendix B
In this paper, we decompose aerodynamic drag D in the direction of the x-axis, y-axis
and z-axis of the body coordinate system as

D = X1 + Y1 + Z1 (A4)
Aerospace 2023, 10, 590 15 of 16

where X1 , Y1 and Z1 are in the direction of the x-axis, y-axis, and z-axis of the body
coordinate system, respectively, which can be described as
 1
 X1 = Cx ρv2 Sre f = Cx qSre f
2




1

Y1 = Cy ρv2 Sre f = Cy qSre f (A5)

 2
 Z = 1 C ρv2 S = C qS



1 z re f z re f
2
where ρ is the atmospheric density, v is the rocket’s velocity relative to the atmosphere, Sre f
is the reference area of the rocket, q = 21 ρv2 is dynamic pressure, and Cx , Cy and Cz are the
axial force coefficient, normal force coefficient, and lateral force coefficient, respectively.
We define the angle between Ob xb and Ov xv as the total angle of attack η, which can be
expressed as
cos η = cos α cos β (A6)
Then the aerodynamic drag coefficient for Cx , Cy and Cz can be expressed as

η2


 Cx = − C A − C A η 2

η cos β sin α



Cy = CN η (A7)
 sin η

η sin β


 Cz = −CN η


sin η

when η = 0, 
η2 2
 Cx = − C A − C A η


Cy = 0 (A8)


C = 0
z

η2 η
where C A , C A and CN are aerodynamic parameters.

References
1. Cui, N.; Wu, R.; Wei, C.; Xu, D.; Zhang, L. Development and Key Technologies of Vertical Takeoff Vertical Landing Reusable
Launch Vehicle. Astronaut. Syst. Eng. Technol. 2018, 2, 27–42.
2. Jo, B.U.; Ahn, J. Optimal staging of reusable launch vehicles for minimum life cycle cost. Aerosp. Sci. Technol. 2022, 127, 107703.
[CrossRef]
3. Jones, H. The recent large reduction in space launch cost. In Proceedings of the 48th International Conference on Environmental
Systems, Albuquerque, NM, USA, 8–12 July 2018.
4. Sivan, K.; Pandian, S. An overview of reusable launch vehicle technology demonstrator. Curr. Sci. 2018, 114, 38–47. [CrossRef]
5. de Mirand, A.P.; Bahu, J.M.; Louaas, E. Ariane Next, a vision for a reusable cost efficient European rocket. In Proceedings of the
8th European Conference for Aeronautics and Space Sciences (EUCASS), Madrid, Spain, 1–4 July 2019.
6. Meyerson, R.; Taylor, A. A status report on the development of the Kistler Aerospace K-1 Reusable Launch Vehicle. In Proceedings
of the 16th AIAA Aerodynamic Decelerator Systems Technology Conference and Seminar, Boston, MA, USA, 21–24 May 2001;
p. 2069.
7. Gardinier, D.; Taylor, A. Design and testing of the K-1 reusable launch vehicle landing system airbags. In Proceedings of the 15th
Aerodynamic Decelerator Systems Technology Conference, Toulouse, France, 9–11 June 1999; p. 1757.
8. King, R.; Hengel, J.; Wolf, D. Ares I First Stage Booster Deceleration System-An Overview. In Proceedings of the 20th AIAA
Aerodynamic Decelerator Systems Technology Conference and Seminar, Seattle, WA, USA, 4–7 May 2009; p. 2984.
9. Davis, S. Ares IX Flight Test: The Future Begins Here. In Proceedings of the AIAA SPACE 2008 Conference & Exposition, San
Diego, CA, USA, 9–11 September 2008; p. 7806.
10. Açıkmeşe, B.; Carson, J.M.; Blackmore, L. Lossless convexification of nonconvex control bound and pointing constraints of the
soft landing optimal control problem. IEEE Trans. Control Syst. Technol. 2013, 21, 2104–2113. [CrossRef]
11. Blackmore, L. Autonomous precision landing of space rockets. In Frontiers of Engineering: Reports on Leading-Edge Engineering
from the 2016 Symposium; The Bridge: Washington, DC, USA, 2016; Volume 46, pp. 15–20.
12. Blackmore, L.; Açikmeşe, B.; Scharf, D.P. Minimum-landing-error powered-descent guidance for Mars landing using convex
optimization. J. Guid. Control Dyn. 2010, 33, 1161–1171. [CrossRef]
Aerospace 2023, 10, 590 16 of 16

13. Mattingley, J.; Boyd, S. CVXGEN: A code generator for embedded convex optimization. Optim. Eng. 2012, 13, 1–27. [CrossRef]
14. Sagliano, M.; Mooij, E.; Theil, S. Onboard trajectory generation for entry vehicles via adaptive multivariate pseudospectral
interpolation. J. Guid. Control Dyn. 2016, 40, 466–476. [CrossRef]
15. Seelbinder, D.; Büskens, C. Real-time atmospheric entry trajectory computation using parametric sensitivities. In Proceedings of
the 6th International Conference on Astrodynamics Tools and Techniques, Darmstadt, Germany, 14–17 March 2016.
16. Wang, J.; Cui, N.; Wei, C. Optimal rocket landing guidance using convex optimization and model predictive control. J. Guid.
Control Dyn. 2019, 42, 1078–1092. [CrossRef]
17. Cheng, L.; Wang, Z.; Jiang, F.; Li, J. Fast generation of optimal asteroid landing trajectories using deep neural networks. IEEE
Trans. Aerosp. Electron. Syst. 2019, 56, 2642–2655. [CrossRef]
18. Song, Y.; Miao, X.; Cheng, L.; Gong, S. The feasibility criterion of fuel-optimal planetary landing using neural networks. Aerosp.
Sci. Technol. 2021, 116, 106860. [CrossRef]
19. Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. Deep learning for autonomous lunar landing. Adv.
Astronaut. Sci. 2018, 167, 3285–3306.
20. Furfaro, R.; Bloise, I.; Orlandelli, M.; Di Lizia, P.; Topputo, F.; Linares, R. A recurrent deep architecture for quasi-optimal feedback
guidance in planetary landing. In Proceedings of the IAA SciTech Forum on Space Flight Mechanics and Space Structures and
Materials, Moscow, Russia, 13–15 November 2018; pp. 1–24.
21. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al.
Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [CrossRef] [PubMed]
22. Ye, D.; Liu, Z.; Sun, M.; Shi, B.; Zhao, P.; Wu, H.; Yu, H.; Yang, S.; Wu, X.; Guo, Q.; et al. Mastering complex control in moba
games with deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA,
7–12 February 2020; Volume 34, pp. 6672–6679.
23. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.;
et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [CrossRef] [PubMed]
24. Soleymani, F.; Paquet, E. Financial portfolio optimization with online deep reinforcement learning and restricted stacked
autoencoder—DeepBreath. Expert Syst. Appl. 2020, 156, 113456. [CrossRef]
25. Palanisamy, P. Multi-agent connected autonomous driving using deep reinforcement learning. In Proceedings of the 2020
International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–7.
26. Yu, C.; Wang, X.; Xu, X.; Zhang, M.; Ge, H.; Ren, J.; Sun, L.; Chen, B.; Tan, G. Distributed multiagent coordinated learning for
autonomous driving in highways based on dynamic coordination graphs. IEEE Trans. Intell. Transp. Syst. 2019, 21, 735–748.
[CrossRef]
27. Ciabatti, G.; Daftry, S.; Capobianco, R. Autonomous planetary landing via deep reinforcement learning and transfer learning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021;
pp. 2031–2038.
28. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep
reinforcement learning. arXiv 2015, arXiv:1509.02971.
29. Cheng, L.; Wang, Z.; Jiang, F. Real-time control for fuel-optimal Moon landing based on an interactive deep reinforcement
learning algorithm. Astrodynamics 2019, 3, 375–386. [CrossRef]
30. Furfaro, R.; Scorsoglio, A.; Linares, R.; Massari, M. Adaptive generalized ZEM-ZEV feedback guidance for planetary landing via
a deep reinforcement learning approach. Acta Astronaut. 2020, 171, 156–171. [CrossRef]
31. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of
the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395.
32. Gaudet, B.; Linares, R.; Furfaro, R. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv. Space Res.
2020, 65, 1723–1741. [CrossRef]
33. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep
reinforcement learning. arXiv 2013, arXiv:1312.5602.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like