Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an unmanned ship pose control method based on reinforcement learning PPO2 (Proximal Policy Optimization 2) algorithm, wherein an integrated controller is adopted to replace a traditional inner and outer ring controller, and the PPO2 algorithm is utilized to control the pose and heading of the unmanned ship; through reinforcement learning, the unmanned ship system in a complex environment can be effectively controlled without depending on a specific model.
In order to solve the technical problems, the invention adopts the following technical scheme.
The invention discloses an unmanned ship pose control method based on reinforcement learning PPO2 algorithm, which comprises the following steps:
s1, modeling an unmanned ship environment:
the method comprises the steps of designing a model of an unmanned ship, establishing an unmanned ship operation environment rule, generating an unmanned ship starting point and an unmanned ship ending point, converting the unmanned ship input into two paths of PWM waves, converting the two paths of PWM waves into two paths of motor thrust, designing a layered rewarding function to finish the operation from the starting point to the ending point, and finally obtaining the unmanned ship motor rotating speed under the real condition through the interaction between a simulation environment and an actual unmanned ship, and inputting the converted unmanned ship motor rotating speed as an environment input into a neural network;
s2, setting an action space and a state space, and setting the action space and the state space according to the condition of the unmanned ship;
s3, setting a reward function: setting target weight of rewards, and setting a rewarding function based on a control target of a required unmanned ship so as to control the unmanned ship;
s4, designing a deep neural network architecture:
the deep neural network structure comprises a state cost function estimator network structure and a strategy network structure; for a complete neural network Actor-Critic algorithm, the method has two neural network structures of Actor and Critic;
s5, training a controller based on a PPO2 algorithm:
the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and when the experience pool data are full, the data are all taken out, parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.
Further, in step S3, since the training is performed to enable the unmanned aerial vehicle to move toward the target point, the smaller the distance between the unmanned aerial vehicle and the target point is, the higher the obtained reward is, and in order to enable the unmanned aerial vehicle to track the target smoothly in the target tracking process, the speed of the unmanned aerial vehicle is also used as a part of the reward function design, and the reward function used in the reinforcement learning algorithm for the unmanned aerial vehicle target tracking problem is designed as follows:
r=-angle_normalize(x)-0.1r 2 -0.001(f1+f2) 2 -(u-0.5) 2 -0.0001a u 2
the rewarding function takes the angle and the speed of the unmanned ship as control targets, the radian value in the input function is converted into the range of [ -pi, pi ] through the normalization function, meanwhile, the angle speed weight is set, and the rewarding solves the problem of ineffective exploration of the unmanned ship under sparse rewarding.
Further, in step S4, the Actor network includes three layers, each layer having a plurality of nodes; 2 nodes of each layer of nodes are designed according to the requirements of the controller, namely a heading angle psi and a speed v, 64 nodes of a hidden layer and 2 nodes of an output layer are respectively designed, namely a left motor control rate u l (t) and Right Motor control Rate u r (t) after u (t) is obtained, converting is needed to obtain a rotation speed quantized value, and then the rotation speed of the motor is obtained; the hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u l (t) and Right Motor control Rate u r (t) and heading angle and speed need to be divided by 45 ° and v, respectively max After normalization, the input neural network is carried out, and one node of the output layer is used for estimating V (t) of a value function and evaluating the quality of the action; and when the training of the Actor network and the Critic network reaches the maximum updating times or the error is smaller than the set value, stopping updating the weight. The maximum update times of the Actor networkThe number is 200 and the error threshold is 0.005. The maximum update frequency of the Critic network is set to 100, and the error threshold is set to 0.05.
Further, in step S5: wind speed interference is added in the environment, and the stability of the system under the interference condition is ensured by introducing an integral compensator. In addition, a maximum entropy correction algorithm is added into the basic PPO2 algorithm, so that the underestimation caused by the maximum entropy is made up while the strategy exploratory is ensured, and the learning efficiency of the algorithm is improved.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention adopts the reinforcement learning algorithm to have high self-learning capability, and can adapt to most of complex environments through simple training, thereby realizing autonomous perception and better pose control of the unmanned ship.
2. According to the unmanned aerial vehicle training method, the unmanned aerial vehicle is controlled through the reward and punishment function with the unmanned aerial vehicle characteristics, so that the training speed of the unmanned aerial vehicle under sparse rewards is improved, and the unmanned aerial vehicle can better approach to the target.
3. According to the invention, a delay link is added in the simulation environment of the unmanned ship, so that the real environment of the unmanned ship is simulated to the greatest extent, and preparation is provided for reinforcement learning to be applied to the actual unmanned ship.
4. The maximum entropy correction algorithm is added into the basic PPO2 algorithm, and the correction term can be used for compensating underestimation caused by the maximum entropy while ensuring strategy explorability, so that the learning efficiency of the algorithm is improved.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Step S1: establishing an unmanned ship environment model;
the symbol definition of the unmanned ship model is shown in the table:
the following vectors are noted in unmanned boat motion control:
η 1 =[x,y,z] T ∈R 3 ,
v 1 =[u,v,w] T ∈R 3 ,v 2 =[p,q,r] T ∈R 3
τ 1 =[X,Y,Z] T ∈R 3 ,τ 2 =[K,M,N] T ∈R 3
where η is a position vector and a direction vector of the unmanned ship in the inertial coordinate system, v is a linear velocity vector and an angular velocity vector of the unmanned ship in the body coordinate system, and τ is a force vector and a moment vector of the unmanned ship in the body coordinate system.
The mathematical model of unmanned ship motion is:
where J (η) is the coordinate system transformation matrix, C (v) is the Coriolis centripetal matrix, D (v) is the damping matrix, and g (η) is the restoring force.
The six-degree-of-freedom model of the unmanned surface vehicle is simplified, the movement of the unmanned surface vehicle in three directions of a vertical plane is ignored, and only the movement of the unmanned surface vehicle in three directions of a horizontal plane is considered.
The scalar form of the three degree of freedom model of the unmanned boat is as follows:
wherein m is 11 、m 22 、m 33 Is the diagonal element of the rigid body inertia matrix, d 11 、d 22 、d 33 Is a diagonal element of the damping matrix.
S2: determining an action space and an observation space, and setting the action space and a state space according to the condition of the unmanned ship;
s2.1: the speed control of the unmanned ship sets the action space of the unmanned ship as [ -20,20], the state space is the forward speed and acceleration of the unmanned ship on the water surface, and the space is respectively: [ -1,1],[ -0.1,0.1].
S2.2: the angle control of the unmanned ship, in order to control the swing angle of the unmanned ship, the swing moment of the unmanned ship needs to be controlled, and the space size is [ -2,2]; the state is the angle, angular velocity and angular acceleration of unmanned ship, and its size in space is respectively: [ -1,1],[ -1,1],[ -1.1,1.1].
S2.3: the speed and the angle of the unmanned ship are controlled simultaneously, the action space of the unmanned ship is set to be [ -2,2], the observed state is the running angle of the unmanned ship and the advancing speed of the unmanned ship, and the state space is set as follows: [ -1,1],[ -1,1],[ -1.1,1.1],[ -1,1],[ -0.1,0.1].
S3: comprehensively referencing the unmanned ship model to design a pose rewarding function;
the bonus goal is obtained by:
in the speed control of the unmanned ship, in order to enable the speed of the unmanned ship on the water surface to reach 0.5m/s, when the speed is close to 0.5m/s, the acceleration of the unmanned ship is 0, and in order to enable the unmanned ship to reach the expected target, a reward function is set as follows:
r=-(u-0.5) 2 -0.0001a u 2
in the angle control of the unmanned ship, controlling the angle of the unmanned ship to reach a specified angle, then controlling the angular speed of the unmanned ship to be 0, controlling the angular acceleration of the unmanned ship to be 0, and setting a reward function as follows:
r=-angle_normalize(x)-0.1r 2 -0.001N 2
wherein the-angle_normal () function converts the radian value in the input function into the range of [ -pi, pi ].
In simultaneous control of the angle and speed of the unmanned boat, the unmanned boat is controlled by simultaneously controlling the forward force and moment of the unmanned boat and the cradling force, and the reward function is set as follows:
r=-angle_normalize(x)-0.1r 2 -0.001N 2 -(u-0.5) 2 -0.0001a u 2
meanwhile, in order to achieve a better effect for unmanned ship training, a layered rewarding function is set, a boundary value is set for the movement environment of the unmanned ship, when the unmanned ship runs out of the boundary, the unmanned ship is in a reset environment, and the layered rewarding function is as follows:
where et is the target boundary value and iet is the target prize.
Where mp is the boundary clipping value and bp is the penalty term.
S4: design deep neural network architecture
The deep neural network architecture includes a state cost function estimator network architecture and a policy network architecture. The algorithm has two neural network structures of an Actor and a Critic. The Actor network has three layers, each layer has a plurality of nodes, 2 nodes of each layer of nodes are designed according to the requirement of the controller, the heading angle psi and the speed v are respectively input into the nodes, 64 nodes are hidden, 2 nodes of the output layer are respectively output into the left motor control rate u l (t) and Right Motor control Rate u r And (t) obtaining the rotation speed quantized value by converting after u (t) is obtained, and further obtaining the motor rotation speed. The hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u l (t) and Right Motor control Rate u r (t) and heading angle and speed need to be divided by 45 ° and v, respectively max And (3) inputting the normalized data into a neural network, wherein one node of the output layer is the estimated V (t) of the value function.
Algorithm action selection procedure as shown in fig. 4, for each step of each round, the algorithm needs to do so by first selecting an action, and the strategy adopted in fig. 4 is called action strategy and is denoted by beta. However, beta is not the optimal strategy obtained, beta is only used to generate actions to the environment during training, so as to obtain a data set which we want, and then the data set is used to train strategy mu, so as to obtain the optimal strategy, and random noise N is introduced to the selection of actions in order to balance the relation between exploration and development t The concrete form is as follows:
a t =μ(s t |θ μ )+N t
s5: controller training based on PPO2 algorithm;
the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and (3) when the experience pool data are full, the data are all taken out, and parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained.
Three networks, an evaluator network, two actor networks (new actor and old actor networks) are implemented in the PPO2 algorithm. The input of the actor network is the angle and the speed of the unmanned ship, the output is a mean value and a variance, and a normal distribution obtained by the mean value and the variance is returned, and the action is sampled based on the normal distribution. The input of the Critic network is the same as that of the actor, and the output is an advantage value as a standard for evaluating the operation quality.
After the algorithm collects batch data, an estimated value function is obtained by using a critic network, and then the value function of each moment in the collected batch data is calculated according to a certain discount rate according to the estimated value function of the critic network and the reward of each moment stored in the batch data, wherein the formula is as follows:
in learning this batch data collected, the old actor network is used. Parameters in the new actor network from which this batch data was obtained are copied to the old actor network, and then learning of the new actor network and evaluation network is started. Firstly, inputting the speed and angle state of the unmanned ship stored by the batch data into an evaluation network, outputting an estimated value function by the evaluation network, then calculating a target value function, finally calculating a dominance function (TD error), and then optimizing parameters of a new actor network N times by using the TD error of the batch data, wherein loss is shown as the following formula:
after the training process is finished, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.
Meanwhile, in order to enable the training effect to be closer to reality and the training frequency to be less, a delay link is added in an unmanned ship operation environment, the added delay is preprocessed in a neural network to obtain a processed state, and then the unmanned ship is trained through a PPO2 algorithm. In order to solve the problem of underestimation of a value function caused by an algorithm, a maximum entropy correction algorithm is added into a PPO2 algorithm, an estimation of a state action value function is designed by using a state value function and a strategy function, and a new objective function is constructed by using a constructed state action value function through a Belman optimal equation. The new objective function increases the expected return of the algorithm and the convergence speed of the algorithm. And one more correction term is added in the maximum entropy optimization algorithm compared with the original objective function. The correction term can make up underestimation caused by maximum entropy while ensuring strategy explorability, and improves the learning efficiency of an algorithm.