CN114077258B

CN114077258B - Unmanned ship pose control method based on reinforcement learning PPO2 algorithm

Info

Publication number: CN114077258B
Application number: CN202111410180.7A
Authority: CN
Inventors: 薛文涛; 吴帅; 李顺; 叶辉; 杨晓飞
Original assignee: Jiangsu University of Science and Technology
Current assignee: Suzhou Xiaobo Intelligent Technology Co ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2023-11-21
Anticipated expiration: 2041-11-22
Also published as: CN114077258A

Abstract

The invention discloses a method for controlling the posture of an unmanned boat based on the reinforcement learning PPO2 algorithm, which includes modeling the environment of the unmanned boat; setting the action and state space according to the situation of the unmanned boat; setting the reward target weight, based on the required unmanned boat Set the reward function to control the unmanned boat as the boat control target; design a deep neural network, including a state value function estimator network and a policy network; use the PPO2 algorithm to train the unmanned boat pose controller, and iterate the policy network parameters until the set After the specified number of training cycles is completed, observe the position and attitude control results of the unmanned boat, and save the learning step length, observation space, action space, training strategy, and trained neural network as the next call of the unmanned boat. The present invention uses the PPO2 algorithm to control the attitude and course of the unmanned boat; through reinforcement learning and independent of specific models, it can effectively control the unmanned boat system in complex environments.

Description

Unmanned ship pose control method based on reinforcement learning PPO2 algorithm

Technical Field

The invention belongs to the technical field of unmanned ship control, and relates to an unmanned ship pose control method based on reinforcement learning PPO2 algorithm.

Background

The unmanned ship is a water surface unmanned aircraft capable of realizing autonomous navigation, autonomous obstacle avoidance and autonomous water surface operation, and has the advantages of small volume, high speed, stealth, no risk of casualties and the like. The unmanned ship is very suitable for executing water surface operation tasks in dangerous sea areas with high risk of casualties or simple water surface operation tasks with low participation requirements on personnel, and has good application, so that the unmanned ship is widely and effectively applied to the fields of ocean monitoring, ocean investigation, maritime search and rescue, unmanned freight and the like.

Although unmanned ships are researched and developed significantly, the unmanned ships on the water surface are difficult to control at present, and the aspects of a complex nonlinear system, a plurality of controlled variables, mutual coupling among the variables, an underdrive system and the like are difficult to control the unmanned ships on the water surface. The attitude and position control of the unmanned surface vehicle is an important part of research on the unmanned surface vehicle, and the main research is to accurately control the attitude and position of the unmanned surface vehicle in a complex water surface environment under the condition of external interference and sea waves.

Reinforcement learning is an important branch in machine learning, developed from disciplines such as control science and computer science, and is a process of learning and selecting appropriate behavior actions to obtain the maximum accumulated return after an agent performs interactive trial and error in an environment. Reinforcement learning can be considered as a punishment learning method, and is an effective method for solving the sequential decision problem.

In the prior art, an accurate model of the unmanned ship needs to be obtained for controlling the unmanned ship on the water surface, the unmanned ship is controlled by multiple variables, the mutual coupling among the variables is more difficult to control the unmanned ship, and the unmanned ship, in particular to a gesture motion controller of the unmanned ship, is designed by using reinforcement learning.

The chinese patent of publication No. CN 112540614A discloses a deep reinforcement learning framework for unmanned craft track control with a large hysteresis system, through which a large hysteresis non-markov system such as an unmanned craft can also achieve a good training effect through deep reinforcement learning. The unmanned ship model has the defect that an unmanned ship kinematics model is not added in the environment; the unmanned ship attitude controllers are controlled less by reinforcement learning; longer training times are required.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an unmanned ship pose control method based on reinforcement learning PPO2 (Proximal Policy Optimization 2) algorithm, wherein an integrated controller is adopted to replace a traditional inner and outer ring controller, and the PPO2 algorithm is utilized to control the pose and heading of the unmanned ship; through reinforcement learning, the unmanned ship system in a complex environment can be effectively controlled without depending on a specific model.

In order to solve the technical problems, the invention adopts the following technical scheme.

The invention discloses an unmanned ship pose control method based on reinforcement learning PPO2 algorithm, which comprises the following steps:

s1, modeling an unmanned ship environment:

the method comprises the steps of designing a model of an unmanned ship, establishing an unmanned ship operation environment rule, generating an unmanned ship starting point and an unmanned ship ending point, converting the unmanned ship input into two paths of PWM waves, converting the two paths of PWM waves into two paths of motor thrust, designing a layered rewarding function to finish the operation from the starting point to the ending point, and finally obtaining the unmanned ship motor rotating speed under the real condition through the interaction between a simulation environment and an actual unmanned ship, and inputting the converted unmanned ship motor rotating speed as an environment input into a neural network;

s2, setting an action space and a state space, and setting the action space and the state space according to the condition of the unmanned ship;

s3, setting a reward function: setting target weight of rewards, and setting a rewarding function based on a control target of a required unmanned ship so as to control the unmanned ship;

s4, designing a deep neural network architecture:

the deep neural network structure comprises a state cost function estimator network structure and a strategy network structure; for a complete neural network Actor-Critic algorithm, the method has two neural network structures of Actor and Critic;

s5, training a controller based on a PPO2 algorithm:

the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and when the experience pool data are full, the data are all taken out, parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.

Further, in step S3, since the training is performed to enable the unmanned aerial vehicle to move toward the target point, the smaller the distance between the unmanned aerial vehicle and the target point is, the higher the obtained reward is, and in order to enable the unmanned aerial vehicle to track the target smoothly in the target tracking process, the speed of the unmanned aerial vehicle is also used as a part of the reward function design, and the reward function used in the reinforcement learning algorithm for the unmanned aerial vehicle target tracking problem is designed as follows:

r＝-angle_normalize(x)-0.1r ² -0.001(f1+f2) ² -(u-0.5) ² -0.0001a _u ²

the rewarding function takes the angle and the speed of the unmanned ship as control targets, the radian value in the input function is converted into the range of [ -pi, pi ] through the normalization function, meanwhile, the angle speed weight is set, and the rewarding solves the problem of ineffective exploration of the unmanned ship under sparse rewarding.

Further, in step S4, the Actor network includes three layers, each layer having a plurality of nodes; 2 nodes of each layer of nodes are designed according to the requirements of the controller, namely a heading angle psi and a speed v, 64 nodes of a hidden layer and 2 nodes of an output layer are respectively designed, namely a left motor control rate u _l (t) and Right Motor control Rate u _r (t) after u (t) is obtained, converting is needed to obtain a rotation speed quantized value, and then the rotation speed of the motor is obtained; the hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u _l (t) and Right Motor control Rate u _r (t) and heading angle and speed need to be divided by 45 ° and v, respectively _max After normalization, the input neural network is carried out, and one node of the output layer is used for estimating V (t) of a value function and evaluating the quality of the action; and when the training of the Actor network and the Critic network reaches the maximum updating times or the error is smaller than the set value, stopping updating the weight. The maximum update times of the Actor networkThe number is 200 and the error threshold is 0.005. The maximum update frequency of the Critic network is set to 100, and the error threshold is set to 0.05.

Further, in step S5: wind speed interference is added in the environment, and the stability of the system under the interference condition is ensured by introducing an integral compensator. In addition, a maximum entropy correction algorithm is added into the basic PPO2 algorithm, so that the underestimation caused by the maximum entropy is made up while the strategy exploratory is ensured, and the learning efficiency of the algorithm is improved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention adopts the reinforcement learning algorithm to have high self-learning capability, and can adapt to most of complex environments through simple training, thereby realizing autonomous perception and better pose control of the unmanned ship.

2. According to the unmanned aerial vehicle training method, the unmanned aerial vehicle is controlled through the reward and punishment function with the unmanned aerial vehicle characteristics, so that the training speed of the unmanned aerial vehicle under sparse rewards is improved, and the unmanned aerial vehicle can better approach to the target.

3. According to the invention, a delay link is added in the simulation environment of the unmanned ship, so that the real environment of the unmanned ship is simulated to the greatest extent, and preparation is provided for reinforcement learning to be applied to the actual unmanned ship.

4. The maximum entropy correction algorithm is added into the basic PPO2 algorithm, and the correction term can be used for compensating underestimation caused by the maximum entropy while ensuring strategy explorability, so that the learning efficiency of the algorithm is improved.

Drawings

FIG. 1 is a schematic diagram of a reinforcement learning PPO2 algorithm employed in the present invention.

Fig. 2 is a method flow diagram of one embodiment of the present invention.

FIG. 3 is a reinforcement learning Markov decision flow of one embodiment method of the present invention.

Fig. 4 is a schematic illustration of the selection of actions of an unmanned boat according to an embodiment of the present invention.

Fig. 5 is a reinforcement learning PPO2 unmanned boat controller of an embodiment of the present invention.

FIG. 6 is an unmanned boat training loss curve for one embodiment of the present invention.

FIG. 7 is a plot of the prize function achieved by an unmanned boat according to one embodiment of the invention.

FIG. 8 is a plot of unmanned boat heading angle change (in degrees) for one embodiment of the invention.

FIG. 9 is a plot of unmanned boat speed change (in meters per second) for one embodiment of the present invention.

Fig. 10 is a plot of unmanned boat position change (in meters) for one embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Step S1: establishing an unmanned ship environment model;

the symbol definition of the unmanned ship model is shown in the table:

the following vectors are noted in unmanned boat motion control:

η ₁ ＝[x,y,z] ^T ∈R ³ ,

v ₁ ＝[u,v,w] ^T ∈R ³ ,v ₂ ＝[p,q,r] ^T ∈R ³

τ ₁ ＝[X,Y,Z] ^T ∈R ³ ,τ ₂ ＝[K,M,N] ^T ∈R ³

where η is a position vector and a direction vector of the unmanned ship in the inertial coordinate system, v is a linear velocity vector and an angular velocity vector of the unmanned ship in the body coordinate system, and τ is a force vector and a moment vector of the unmanned ship in the body coordinate system.

The mathematical model of unmanned ship motion is:

where J (η) is the coordinate system transformation matrix, C (v) is the Coriolis centripetal matrix, D (v) is the damping matrix, and g (η) is the restoring force.

The six-degree-of-freedom model of the unmanned surface vehicle is simplified, the movement of the unmanned surface vehicle in three directions of a vertical plane is ignored, and only the movement of the unmanned surface vehicle in three directions of a horizontal plane is considered.

The scalar form of the three degree of freedom model of the unmanned boat is as follows:

wherein m is ₁₁ 、m ₂₂ 、m ₃₃ Is the diagonal element of the rigid body inertia matrix, d ₁₁ 、d ₂₂ 、d ₃₃ Is a diagonal element of the damping matrix.

S2: determining an action space and an observation space, and setting the action space and a state space according to the condition of the unmanned ship;

s2.1: the speed control of the unmanned ship sets the action space of the unmanned ship as [ -20,20], the state space is the forward speed and acceleration of the unmanned ship on the water surface, and the space is respectively: [ -1,1],[ -0.1,0.1].

S2.2: the angle control of the unmanned ship, in order to control the swing angle of the unmanned ship, the swing moment of the unmanned ship needs to be controlled, and the space size is [ -2,2]; the state is the angle, angular velocity and angular acceleration of unmanned ship, and its size in space is respectively: [ -1,1],[ -1,1],[ -1.1,1.1].

S2.3: the speed and the angle of the unmanned ship are controlled simultaneously, the action space of the unmanned ship is set to be [ -2,2], the observed state is the running angle of the unmanned ship and the advancing speed of the unmanned ship, and the state space is set as follows: [ -1,1],[ -1,1],[ -1.1,1.1],[ -1,1],[ -0.1,0.1].

S3: comprehensively referencing the unmanned ship model to design a pose rewarding function;

the bonus goal is obtained by:

in the speed control of the unmanned ship, in order to enable the speed of the unmanned ship on the water surface to reach 0.5m/s, when the speed is close to 0.5m/s, the acceleration of the unmanned ship is 0, and in order to enable the unmanned ship to reach the expected target, a reward function is set as follows:

r＝-(u-0.5) ² -0.0001a _u ²

in the angle control of the unmanned ship, controlling the angle of the unmanned ship to reach a specified angle, then controlling the angular speed of the unmanned ship to be 0, controlling the angular acceleration of the unmanned ship to be 0, and setting a reward function as follows:

r＝-angle_normalize(x)-0.1r ² -0.001N ²

wherein the-angle_normal () function converts the radian value in the input function into the range of [ -pi, pi ].

In simultaneous control of the angle and speed of the unmanned boat, the unmanned boat is controlled by simultaneously controlling the forward force and moment of the unmanned boat and the cradling force, and the reward function is set as follows:

r＝-angle_normalize(x)-0.1r ² -0.001N ² -(u-0.5) ² -0.0001a _u ²

meanwhile, in order to achieve a better effect for unmanned ship training, a layered rewarding function is set, a boundary value is set for the movement environment of the unmanned ship, when the unmanned ship runs out of the boundary, the unmanned ship is in a reset environment, and the layered rewarding function is as follows:

where et is the target boundary value and iet is the target prize.

Where mp is the boundary clipping value and bp is the penalty term.

S4: design deep neural network architecture

The deep neural network architecture includes a state cost function estimator network architecture and a policy network architecture. The algorithm has two neural network structures of an Actor and a Critic. The Actor network has three layers, each layer has a plurality of nodes, 2 nodes of each layer of nodes are designed according to the requirement of the controller, the heading angle psi and the speed v are respectively input into the nodes, 64 nodes are hidden, 2 nodes of the output layer are respectively output into the left motor control rate u _l (t) and Right Motor control Rate u _r And (t) obtaining the rotation speed quantized value by converting after u (t) is obtained, and further obtaining the motor rotation speed. The hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u _l (t) and Right Motor control Rate u _r (t) and heading angle and speed need to be divided by 45 ° and v, respectively _max And (3) inputting the normalized data into a neural network, wherein one node of the output layer is the estimated V (t) of the value function.

Algorithm action selection procedure as shown in fig. 4, for each step of each round, the algorithm needs to do so by first selecting an action, and the strategy adopted in fig. 4 is called action strategy and is denoted by beta. However, beta is not the optimal strategy obtained, beta is only used to generate actions to the environment during training, so as to obtain a data set which we want, and then the data set is used to train strategy mu, so as to obtain the optimal strategy, and random noise N is introduced to the selection of actions in order to balance the relation between exploration and development _t The concrete form is as follows:

a _t ＝μ(s _t |θ ^μ )+N _t

s5: controller training based on PPO2 algorithm;

the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and (3) when the experience pool data are full, the data are all taken out, and parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained.

Three networks, an evaluator network, two actor networks (new actor and old actor networks) are implemented in the PPO2 algorithm. The input of the actor network is the angle and the speed of the unmanned ship, the output is a mean value and a variance, and a normal distribution obtained by the mean value and the variance is returned, and the action is sampled based on the normal distribution. The input of the Critic network is the same as that of the actor, and the output is an advantage value as a standard for evaluating the operation quality.

After the algorithm collects batch data, an estimated value function is obtained by using a critic network, and then the value function of each moment in the collected batch data is calculated according to a certain discount rate according to the estimated value function of the critic network and the reward of each moment stored in the batch data, wherein the formula is as follows:

in learning this batch data collected, the old actor network is used. Parameters in the new actor network from which this batch data was obtained are copied to the old actor network, and then learning of the new actor network and evaluation network is started. Firstly, inputting the speed and angle state of the unmanned ship stored by the batch data into an evaluation network, outputting an estimated value function by the evaluation network, then calculating a target value function, finally calculating a dominance function (TD error), and then optimizing parameters of a new actor network N times by using the TD error of the batch data, wherein loss is shown as the following formula:

after the training process is finished, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.

Meanwhile, in order to enable the training effect to be closer to reality and the training frequency to be less, a delay link is added in an unmanned ship operation environment, the added delay is preprocessed in a neural network to obtain a processed state, and then the unmanned ship is trained through a PPO2 algorithm. In order to solve the problem of underestimation of a value function caused by an algorithm, a maximum entropy correction algorithm is added into a PPO2 algorithm, an estimation of a state action value function is designed by using a state value function and a strategy function, and a new objective function is constructed by using a constructed state action value function through a Belman optimal equation. The new objective function increases the expected return of the algorithm and the convergence speed of the algorithm. And one more correction term is added in the maximum entropy optimization algorithm compared with the original objective function. The correction term can make up underestimation caused by maximum entropy while ensuring strategy explorability, and improves the learning efficiency of an algorithm.

Claims

1. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm is characterized by comprising the following steps:

s1, modeling an unmanned ship environment:

s4, designing a deep neural network architecture:

s5, training a controller based on a PPO2 algorithm:

2. The method for controlling the pose of the unmanned ship based on the reinforcement learning PPO2 algorithm according to claim 1, wherein in step S3, since the training is to enable the unmanned ship to move toward the target point, the smaller the distance between the unmanned ship and the target point is, the higher the obtained reward is, and in order to enable the unmanned ship to track the target smoothly during the target tracking process, the speed of the unmanned ship is also used as a part of the reward function design, and the reward function used in the reinforcement learning algorithm for the unmanned ship target tracking problem is designed as follows:

r＝-angle_normalize(x)-0.1r ² -0.001(f1+f2) ² -(u-0.5) ² -0.0001a _u ²

3. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S4, the Actor network comprises three layers of structures, each layer of structure having a plurality of nodes; 2 nodes of each layer of node input layer are designed according to the requirements of the controller and are respectively course anglesAnd the speed v, the hidden layer 64 nodes and the output layer 2 nodes are respectively the left motor control rate u _l (t) and Right Motor control Rate u _r (t) after u (t) is obtained, converting is needed to obtain a rotation speed quantized value, and then the rotation speed of the motor is obtained; the hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are heading angles +.>Speed, left motor control ratio u _l (t) and Right Motor control Rate u _r (t) and heading angle and speed need to be divided by 45 ° and v, respectively _max After normalization, the input neural network is carried out, and one node of the output layer is used for estimating V (t) of a value function and evaluating the quality of the action; and when the training of the Actor network and the Critic network reaches the maximum updating times or the error is smaller than the set value, stopping updating the weight.

4. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1 or 3, wherein the maximum update time of the Actor network is set to 200, and the error threshold is set to 0.005.

5. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1 or 3, wherein the maximum update time of Critic network is set to 100, and the error threshold is set to 0.05.

6. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S5: wind speed interference is added in the environment, and the stability of the system under the interference condition is ensured by introducing an integral compensator.

7. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S5: the maximum entropy correction algorithm is added into the basic PPO2 algorithm, so that the strategy explorability is ensured, meanwhile, underestimation caused by the maximum entropy is made up, and the learning efficiency of the algorithm is improved.