[go: up one dir, main page]

CN114077258B - Unmanned ship pose control method based on reinforcement learning PPO2 algorithm - Google Patents

Unmanned ship pose control method based on reinforcement learning PPO2 algorithm Download PDF

Info

Publication number
CN114077258B
CN114077258B CN202111410180.7A CN202111410180A CN114077258B CN 114077258 B CN114077258 B CN 114077258B CN 202111410180 A CN202111410180 A CN 202111410180A CN 114077258 B CN114077258 B CN 114077258B
Authority
CN
China
Prior art keywords
unmanned ship
algorithm
ppo2
unmanned
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111410180.7A
Other languages
Chinese (zh)
Other versions
CN114077258A (en
Inventor
薛文涛
吴帅
李顺
叶辉
杨晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Xiaobo Intelligent Technology Co ltd
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN202111410180.7A priority Critical patent/CN114077258B/en
Publication of CN114077258A publication Critical patent/CN114077258A/en
Application granted granted Critical
Publication of CN114077258B publication Critical patent/CN114077258B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/08Control of attitude, i.e. control of roll, pitch, or yaw
    • G05D1/0875Control of attitude, i.e. control of roll, pitch, or yaw specially adapted to water vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

本发明公开了一种基于强化学习PPO2算法的无人艇位姿控制方法,包括无人艇环境建模;根据无人艇的情况设置动作和状态空间;设置奖励目标权重,基于所需无人艇控制目标设置奖励函数来控制无人艇;设计深度神经网络,包括状态价值函数估计器网络和策略网络;用PPO2算法进行无人艇位姿控制器训练,并对策略网络参数迭代,直到设定的训练周期数目全部结束,观察无人艇位姿控制结果,并将学习步长、观测空间、动作空间、训练策略以及训练完成的神经网络保存,作为无人艇下次调用。本发明利用PPO2算法进行无人艇的姿态航向控制;通过强化学习,不依赖于具体模型,能有效地控制复杂环境下的无人艇系统。

The invention discloses a method for controlling the posture of an unmanned boat based on the reinforcement learning PPO2 algorithm, which includes modeling the environment of the unmanned boat; setting the action and state space according to the situation of the unmanned boat; setting the reward target weight, based on the required unmanned boat Set the reward function to control the unmanned boat as the boat control target; design a deep neural network, including a state value function estimator network and a policy network; use the PPO2 algorithm to train the unmanned boat pose controller, and iterate the policy network parameters until the set After the specified number of training cycles is completed, observe the position and attitude control results of the unmanned boat, and save the learning step length, observation space, action space, training strategy, and trained neural network as the next call of the unmanned boat. The present invention uses the PPO2 algorithm to control the attitude and course of the unmanned boat; through reinforcement learning and independent of specific models, it can effectively control the unmanned boat system in complex environments.

Description

Unmanned ship pose control method based on reinforcement learning PPO2 algorithm
Technical Field
The invention belongs to the technical field of unmanned ship control, and relates to an unmanned ship pose control method based on reinforcement learning PPO2 algorithm.
Background
The unmanned ship is a water surface unmanned aircraft capable of realizing autonomous navigation, autonomous obstacle avoidance and autonomous water surface operation, and has the advantages of small volume, high speed, stealth, no risk of casualties and the like. The unmanned ship is very suitable for executing water surface operation tasks in dangerous sea areas with high risk of casualties or simple water surface operation tasks with low participation requirements on personnel, and has good application, so that the unmanned ship is widely and effectively applied to the fields of ocean monitoring, ocean investigation, maritime search and rescue, unmanned freight and the like.
Although unmanned ships are researched and developed significantly, the unmanned ships on the water surface are difficult to control at present, and the aspects of a complex nonlinear system, a plurality of controlled variables, mutual coupling among the variables, an underdrive system and the like are difficult to control the unmanned ships on the water surface. The attitude and position control of the unmanned surface vehicle is an important part of research on the unmanned surface vehicle, and the main research is to accurately control the attitude and position of the unmanned surface vehicle in a complex water surface environment under the condition of external interference and sea waves.
Reinforcement learning is an important branch in machine learning, developed from disciplines such as control science and computer science, and is a process of learning and selecting appropriate behavior actions to obtain the maximum accumulated return after an agent performs interactive trial and error in an environment. Reinforcement learning can be considered as a punishment learning method, and is an effective method for solving the sequential decision problem.
In the prior art, an accurate model of the unmanned ship needs to be obtained for controlling the unmanned ship on the water surface, the unmanned ship is controlled by multiple variables, the mutual coupling among the variables is more difficult to control the unmanned ship, and the unmanned ship, in particular to a gesture motion controller of the unmanned ship, is designed by using reinforcement learning.
The chinese patent of publication No. CN 112540614A discloses a deep reinforcement learning framework for unmanned craft track control with a large hysteresis system, through which a large hysteresis non-markov system such as an unmanned craft can also achieve a good training effect through deep reinforcement learning. The unmanned ship model has the defect that an unmanned ship kinematics model is not added in the environment; the unmanned ship attitude controllers are controlled less by reinforcement learning; longer training times are required.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an unmanned ship pose control method based on reinforcement learning PPO2 (Proximal Policy Optimization 2) algorithm, wherein an integrated controller is adopted to replace a traditional inner and outer ring controller, and the PPO2 algorithm is utilized to control the pose and heading of the unmanned ship; through reinforcement learning, the unmanned ship system in a complex environment can be effectively controlled without depending on a specific model.
In order to solve the technical problems, the invention adopts the following technical scheme.
The invention discloses an unmanned ship pose control method based on reinforcement learning PPO2 algorithm, which comprises the following steps:
s1, modeling an unmanned ship environment:
the method comprises the steps of designing a model of an unmanned ship, establishing an unmanned ship operation environment rule, generating an unmanned ship starting point and an unmanned ship ending point, converting the unmanned ship input into two paths of PWM waves, converting the two paths of PWM waves into two paths of motor thrust, designing a layered rewarding function to finish the operation from the starting point to the ending point, and finally obtaining the unmanned ship motor rotating speed under the real condition through the interaction between a simulation environment and an actual unmanned ship, and inputting the converted unmanned ship motor rotating speed as an environment input into a neural network;
s2, setting an action space and a state space, and setting the action space and the state space according to the condition of the unmanned ship;
s3, setting a reward function: setting target weight of rewards, and setting a rewarding function based on a control target of a required unmanned ship so as to control the unmanned ship;
s4, designing a deep neural network architecture:
the deep neural network structure comprises a state cost function estimator network structure and a strategy network structure; for a complete neural network Actor-Critic algorithm, the method has two neural network structures of Actor and Critic;
s5, training a controller based on a PPO2 algorithm:
the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and when the experience pool data are full, the data are all taken out, parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.
Further, in step S3, since the training is performed to enable the unmanned aerial vehicle to move toward the target point, the smaller the distance between the unmanned aerial vehicle and the target point is, the higher the obtained reward is, and in order to enable the unmanned aerial vehicle to track the target smoothly in the target tracking process, the speed of the unmanned aerial vehicle is also used as a part of the reward function design, and the reward function used in the reinforcement learning algorithm for the unmanned aerial vehicle target tracking problem is designed as follows:
r=-angle_normalize(x)-0.1r 2 -0.001(f1+f2) 2 -(u-0.5) 2 -0.0001a u 2
the rewarding function takes the angle and the speed of the unmanned ship as control targets, the radian value in the input function is converted into the range of [ -pi, pi ] through the normalization function, meanwhile, the angle speed weight is set, and the rewarding solves the problem of ineffective exploration of the unmanned ship under sparse rewarding.
Further, in step S4, the Actor network includes three layers, each layer having a plurality of nodes; 2 nodes of each layer of nodes are designed according to the requirements of the controller, namely a heading angle psi and a speed v, 64 nodes of a hidden layer and 2 nodes of an output layer are respectively designed, namely a left motor control rate u l (t) and Right Motor control Rate u r (t) after u (t) is obtained, converting is needed to obtain a rotation speed quantized value, and then the rotation speed of the motor is obtained; the hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u l (t) and Right Motor control Rate u r (t) and heading angle and speed need to be divided by 45 ° and v, respectively max After normalization, the input neural network is carried out, and one node of the output layer is used for estimating V (t) of a value function and evaluating the quality of the action; and when the training of the Actor network and the Critic network reaches the maximum updating times or the error is smaller than the set value, stopping updating the weight. The maximum update times of the Actor networkThe number is 200 and the error threshold is 0.005. The maximum update frequency of the Critic network is set to 100, and the error threshold is set to 0.05.
Further, in step S5: wind speed interference is added in the environment, and the stability of the system under the interference condition is ensured by introducing an integral compensator. In addition, a maximum entropy correction algorithm is added into the basic PPO2 algorithm, so that the underestimation caused by the maximum entropy is made up while the strategy exploratory is ensured, and the learning efficiency of the algorithm is improved.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the invention adopts the reinforcement learning algorithm to have high self-learning capability, and can adapt to most of complex environments through simple training, thereby realizing autonomous perception and better pose control of the unmanned ship.
2. According to the unmanned aerial vehicle training method, the unmanned aerial vehicle is controlled through the reward and punishment function with the unmanned aerial vehicle characteristics, so that the training speed of the unmanned aerial vehicle under sparse rewards is improved, and the unmanned aerial vehicle can better approach to the target.
3. According to the invention, a delay link is added in the simulation environment of the unmanned ship, so that the real environment of the unmanned ship is simulated to the greatest extent, and preparation is provided for reinforcement learning to be applied to the actual unmanned ship.
4. The maximum entropy correction algorithm is added into the basic PPO2 algorithm, and the correction term can be used for compensating underestimation caused by the maximum entropy while ensuring strategy explorability, so that the learning efficiency of the algorithm is improved.
Drawings
FIG. 1 is a schematic diagram of a reinforcement learning PPO2 algorithm employed in the present invention.
Fig. 2 is a method flow diagram of one embodiment of the present invention.
FIG. 3 is a reinforcement learning Markov decision flow of one embodiment method of the present invention.
Fig. 4 is a schematic illustration of the selection of actions of an unmanned boat according to an embodiment of the present invention.
Fig. 5 is a reinforcement learning PPO2 unmanned boat controller of an embodiment of the present invention.
FIG. 6 is an unmanned boat training loss curve for one embodiment of the present invention.
FIG. 7 is a plot of the prize function achieved by an unmanned boat according to one embodiment of the invention.
FIG. 8 is a plot of unmanned boat heading angle change (in degrees) for one embodiment of the invention.
FIG. 9 is a plot of unmanned boat speed change (in meters per second) for one embodiment of the present invention.
Fig. 10 is a plot of unmanned boat position change (in meters) for one embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Step S1: establishing an unmanned ship environment model;
the symbol definition of the unmanned ship model is shown in the table:
the following vectors are noted in unmanned boat motion control:
η 1 =[x,y,z] T ∈R 3 ,
v 1 =[u,v,w] T ∈R 3 ,v 2 =[p,q,r] T ∈R 3
τ 1 =[X,Y,Z] T ∈R 32 =[K,M,N] T ∈R 3
where η is a position vector and a direction vector of the unmanned ship in the inertial coordinate system, v is a linear velocity vector and an angular velocity vector of the unmanned ship in the body coordinate system, and τ is a force vector and a moment vector of the unmanned ship in the body coordinate system.
The mathematical model of unmanned ship motion is:
where J (η) is the coordinate system transformation matrix, C (v) is the Coriolis centripetal matrix, D (v) is the damping matrix, and g (η) is the restoring force.
The six-degree-of-freedom model of the unmanned surface vehicle is simplified, the movement of the unmanned surface vehicle in three directions of a vertical plane is ignored, and only the movement of the unmanned surface vehicle in three directions of a horizontal plane is considered.
The scalar form of the three degree of freedom model of the unmanned boat is as follows:
wherein m is 11 、m 22 、m 33 Is the diagonal element of the rigid body inertia matrix, d 11 、d 22 、d 33 Is a diagonal element of the damping matrix.
S2: determining an action space and an observation space, and setting the action space and a state space according to the condition of the unmanned ship;
s2.1: the speed control of the unmanned ship sets the action space of the unmanned ship as [ -20,20], the state space is the forward speed and acceleration of the unmanned ship on the water surface, and the space is respectively: [ -1,1],[ -0.1,0.1].
S2.2: the angle control of the unmanned ship, in order to control the swing angle of the unmanned ship, the swing moment of the unmanned ship needs to be controlled, and the space size is [ -2,2]; the state is the angle, angular velocity and angular acceleration of unmanned ship, and its size in space is respectively: [ -1,1],[ -1,1],[ -1.1,1.1].
S2.3: the speed and the angle of the unmanned ship are controlled simultaneously, the action space of the unmanned ship is set to be [ -2,2], the observed state is the running angle of the unmanned ship and the advancing speed of the unmanned ship, and the state space is set as follows: [ -1,1],[ -1,1],[ -1.1,1.1],[ -1,1],[ -0.1,0.1].
S3: comprehensively referencing the unmanned ship model to design a pose rewarding function;
the bonus goal is obtained by:
in the speed control of the unmanned ship, in order to enable the speed of the unmanned ship on the water surface to reach 0.5m/s, when the speed is close to 0.5m/s, the acceleration of the unmanned ship is 0, and in order to enable the unmanned ship to reach the expected target, a reward function is set as follows:
r=-(u-0.5) 2 -0.0001a u 2
in the angle control of the unmanned ship, controlling the angle of the unmanned ship to reach a specified angle, then controlling the angular speed of the unmanned ship to be 0, controlling the angular acceleration of the unmanned ship to be 0, and setting a reward function as follows:
r=-angle_normalize(x)-0.1r 2 -0.001N 2
wherein the-angle_normal () function converts the radian value in the input function into the range of [ -pi, pi ].
In simultaneous control of the angle and speed of the unmanned boat, the unmanned boat is controlled by simultaneously controlling the forward force and moment of the unmanned boat and the cradling force, and the reward function is set as follows:
r=-angle_normalize(x)-0.1r 2 -0.001N 2 -(u-0.5) 2 -0.0001a u 2
meanwhile, in order to achieve a better effect for unmanned ship training, a layered rewarding function is set, a boundary value is set for the movement environment of the unmanned ship, when the unmanned ship runs out of the boundary, the unmanned ship is in a reset environment, and the layered rewarding function is as follows:
where et is the target boundary value and iet is the target prize.
Where mp is the boundary clipping value and bp is the penalty term.
S4: design deep neural network architecture
The deep neural network architecture includes a state cost function estimator network architecture and a policy network architecture. The algorithm has two neural network structures of an Actor and a Critic. The Actor network has three layers, each layer has a plurality of nodes, 2 nodes of each layer of nodes are designed according to the requirement of the controller, the heading angle psi and the speed v are respectively input into the nodes, 64 nodes are hidden, 2 nodes of the output layer are respectively output into the left motor control rate u l (t) and Right Motor control Rate u r And (t) obtaining the rotation speed quantized value by converting after u (t) is obtained, and further obtaining the motor rotation speed. The hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are respectively the heading angle psi, the speed and the left motor control rate u l (t) and Right Motor control Rate u r (t) and heading angle and speed need to be divided by 45 ° and v, respectively max And (3) inputting the normalized data into a neural network, wherein one node of the output layer is the estimated V (t) of the value function.
Algorithm action selection procedure as shown in fig. 4, for each step of each round, the algorithm needs to do so by first selecting an action, and the strategy adopted in fig. 4 is called action strategy and is denoted by beta. However, beta is not the optimal strategy obtained, beta is only used to generate actions to the environment during training, so as to obtain a data set which we want, and then the data set is used to train strategy mu, so as to obtain the optimal strategy, and random noise N is introduced to the selection of actions in order to balance the relation between exploration and development t The concrete form is as follows:
a t =μ(s tμ )+N t
s5: controller training based on PPO2 algorithm;
the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and (3) when the experience pool data are full, the data are all taken out, and parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained.
Three networks, an evaluator network, two actor networks (new actor and old actor networks) are implemented in the PPO2 algorithm. The input of the actor network is the angle and the speed of the unmanned ship, the output is a mean value and a variance, and a normal distribution obtained by the mean value and the variance is returned, and the action is sampled based on the normal distribution. The input of the Critic network is the same as that of the actor, and the output is an advantage value as a standard for evaluating the operation quality.
After the algorithm collects batch data, an estimated value function is obtained by using a critic network, and then the value function of each moment in the collected batch data is calculated according to a certain discount rate according to the estimated value function of the critic network and the reward of each moment stored in the batch data, wherein the formula is as follows:
in learning this batch data collected, the old actor network is used. Parameters in the new actor network from which this batch data was obtained are copied to the old actor network, and then learning of the new actor network and evaluation network is started. Firstly, inputting the speed and angle state of the unmanned ship stored by the batch data into an evaluation network, outputting an estimated value function by the evaluation network, then calculating a target value function, finally calculating a dominance function (TD error), and then optimizing parameters of a new actor network N times by using the TD error of the batch data, wherein loss is shown as the following formula:
after the training process is finished, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.
Meanwhile, in order to enable the training effect to be closer to reality and the training frequency to be less, a delay link is added in an unmanned ship operation environment, the added delay is preprocessed in a neural network to obtain a processed state, and then the unmanned ship is trained through a PPO2 algorithm. In order to solve the problem of underestimation of a value function caused by an algorithm, a maximum entropy correction algorithm is added into a PPO2 algorithm, an estimation of a state action value function is designed by using a state value function and a strategy function, and a new objective function is constructed by using a constructed state action value function through a Belman optimal equation. The new objective function increases the expected return of the algorithm and the convergence speed of the algorithm. And one more correction term is added in the maximum entropy optimization algorithm compared with the original objective function. The correction term can make up underestimation caused by maximum entropy while ensuring strategy explorability, and improves the learning efficiency of an algorithm.

Claims (7)

1. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm is characterized by comprising the following steps:
s1, modeling an unmanned ship environment:
the method comprises the steps of designing a model of an unmanned ship, establishing an unmanned ship operation environment rule, generating an unmanned ship starting point and an unmanned ship ending point, converting the unmanned ship input into two paths of PWM waves, converting the two paths of PWM waves into two paths of motor thrust, designing a layered rewarding function to finish the operation from the starting point to the ending point, and finally obtaining the unmanned ship motor rotating speed under the real condition through the interaction between a simulation environment and an actual unmanned ship, and inputting the converted unmanned ship motor rotating speed as an environment input into a neural network;
s2, setting an action space and a state space, and setting the action space and the state space according to the condition of the unmanned ship;
s3, setting a reward function: setting target weight of rewards, and setting a rewarding function based on a control target of a required unmanned ship so as to control the unmanned ship;
s4, designing a deep neural network architecture:
the deep neural network structure comprises a state cost function estimator network structure and a strategy network structure; for a complete neural network Actor-Critic algorithm, the method has two neural network structures of Actor and Critic;
s5, training a controller based on a PPO2 algorithm:
the method comprises the steps of training a pose controller of the unmanned ship by using a PPO2 algorithm, setting the total number N of training periods, and carrying out information interaction between the unmanned ship and the environment in each period, namely simulating the motion process of the unmanned ship in the environment and the pose and position change of the unmanned ship, wherein the interacted information data are stored in an experience pool according to a time sequence no matter the tracking result; and when the experience pool data are full, the data are all taken out, parameter iteration is carried out on the strategy network structure according to the PPO2 algorithm until the set training period number is all trained, the pose control result of the unmanned ship is observed, and the learning step length, the observation space, the action space, the training strategy and the trained neural network are saved and used as the next call of the unmanned ship.
2. The method for controlling the pose of the unmanned ship based on the reinforcement learning PPO2 algorithm according to claim 1, wherein in step S3, since the training is to enable the unmanned ship to move toward the target point, the smaller the distance between the unmanned ship and the target point is, the higher the obtained reward is, and in order to enable the unmanned ship to track the target smoothly during the target tracking process, the speed of the unmanned ship is also used as a part of the reward function design, and the reward function used in the reinforcement learning algorithm for the unmanned ship target tracking problem is designed as follows:
r=-angle_normalize(x)-0.1r 2 -0.001(f1+f2) 2 -(u-0.5) 2 -0.0001a u 2
the rewarding function takes the angle and the speed of the unmanned ship as control targets, the radian value in the input function is converted into the range of [ -pi, pi ] through the normalization function, meanwhile, the angle speed weight is set, and the rewarding solves the problem of ineffective exploration of the unmanned ship under sparse rewarding.
3. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S4, the Actor network comprises three layers of structures, each layer of structure having a plurality of nodes; 2 nodes of each layer of node input layer are designed according to the requirements of the controller and are respectively course anglesAnd the speed v, the hidden layer 64 nodes and the output layer 2 nodes are respectively the left motor control rate u l (t) and Right Motor control Rate u r (t) after u (t) is obtained, converting is needed to obtain a rotation speed quantized value, and then the rotation speed of the motor is obtained; the hidden layers of the Critic network and the Actor network are the same, and 4 nodes of the input layer are heading angles +.>Speed, left motor control ratio u l (t) and Right Motor control Rate u r (t) and heading angle and speed need to be divided by 45 ° and v, respectively max After normalization, the input neural network is carried out, and one node of the output layer is used for estimating V (t) of a value function and evaluating the quality of the action; and when the training of the Actor network and the Critic network reaches the maximum updating times or the error is smaller than the set value, stopping updating the weight.
4. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1 or 3, wherein the maximum update time of the Actor network is set to 200, and the error threshold is set to 0.005.
5. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1 or 3, wherein the maximum update time of Critic network is set to 100, and the error threshold is set to 0.05.
6. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S5: wind speed interference is added in the environment, and the stability of the system under the interference condition is ensured by introducing an integral compensator.
7. The unmanned ship pose control method based on reinforcement learning PPO2 algorithm according to claim 1, wherein in step S5: the maximum entropy correction algorithm is added into the basic PPO2 algorithm, so that the strategy explorability is ensured, meanwhile, underestimation caused by the maximum entropy is made up, and the learning efficiency of the algorithm is improved.
CN202111410180.7A 2021-11-22 2021-11-22 Unmanned ship pose control method based on reinforcement learning PPO2 algorithm Active CN114077258B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111410180.7A CN114077258B (en) 2021-11-22 2021-11-22 Unmanned ship pose control method based on reinforcement learning PPO2 algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111410180.7A CN114077258B (en) 2021-11-22 2021-11-22 Unmanned ship pose control method based on reinforcement learning PPO2 algorithm

Publications (2)

Publication Number Publication Date
CN114077258A CN114077258A (en) 2022-02-22
CN114077258B true CN114077258B (en) 2023-11-21

Family

ID=80284244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111410180.7A Active CN114077258B (en) 2021-11-22 2021-11-22 Unmanned ship pose control method based on reinforcement learning PPO2 algorithm

Country Status (1)

Country Link
CN (1) CN114077258B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114879671B (en) * 2022-05-04 2024-10-15 哈尔滨工程大学 Unmanned ship track tracking control method based on reinforcement learning MPC
CN115016496B (en) * 2022-06-30 2024-11-22 重庆大学 Path tracking method of unmanned surface vehicle based on deep reinforcement learning
CN115097847B (en) * 2022-07-20 2025-12-19 广州工业智能研究院 Unmanned ship obstacle avoidance method and system
CN115294674B (en) * 2022-10-09 2022-12-20 南京信息工程大学 A method for monitoring and evaluating the navigation status of unmanned boats
CN115453914B (en) * 2022-10-19 2023-05-16 哈尔滨理工大学 Unmanned ship recovery distributed decision simulation system considering sea wave interference
CN115903820B (en) * 2022-11-29 2025-05-30 上海大学 Multi-unmanned-ship escape game control method
CN116400700B (en) * 2023-04-13 2025-09-26 华中科技大学 A construction method and application of an unmanned boat swarm capture control model
CN118276591A (en) * 2024-05-30 2024-07-02 吉林大学 A tracking control method for underwater autonomous vehicles facing maneuvering targets
CN118915553A (en) * 2024-07-31 2024-11-08 中国科学院沈阳自动化研究所 Intelligent auxiliary driving method for deep sea manned submersible vehicle based on imitation learning
CN119414865B (en) * 2024-10-24 2025-11-28 广州大学 Underwater robot attitude control method and device based on PPO algorithm

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3060900A1 (en) * 2018-11-05 2020-05-05 Royal Bank Of Canada System and method for deep reinforcement learning
CN111580544A (en) * 2020-03-25 2020-08-25 北京航空航天大学 Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN111880535A (en) * 2020-07-23 2020-11-03 上海交通大学 A hybrid sensing autonomous obstacle avoidance method and system for unmanned boats based on reinforcement learning
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
CN112631283A (en) * 2020-12-08 2021-04-09 江苏科技大学 Control system and control method for water-air amphibious unmanned aircraft
CN112947431A (en) * 2021-02-03 2021-06-11 海之韵(苏州)科技有限公司 Unmanned ship path tracking method based on reinforcement learning
CN113093727A (en) * 2021-03-08 2021-07-09 哈尔滨工业大学(深圳) Robot map-free navigation method based on deep security reinforcement learning
CN113110504A (en) * 2021-05-12 2021-07-13 南京云智控产业技术研究院有限公司 Unmanned ship path tracking method based on reinforcement learning and line-of-sight method
CN113176776A (en) * 2021-03-03 2021-07-27 上海大学 Unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3060900A1 (en) * 2018-11-05 2020-05-05 Royal Bank Of Canada System and method for deep reinforcement learning
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
CN111580544A (en) * 2020-03-25 2020-08-25 北京航空航天大学 Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN111880535A (en) * 2020-07-23 2020-11-03 上海交通大学 A hybrid sensing autonomous obstacle avoidance method and system for unmanned boats based on reinforcement learning
CN112631283A (en) * 2020-12-08 2021-04-09 江苏科技大学 Control system and control method for water-air amphibious unmanned aircraft
CN112947431A (en) * 2021-02-03 2021-06-11 海之韵(苏州)科技有限公司 Unmanned ship path tracking method based on reinforcement learning
CN113176776A (en) * 2021-03-03 2021-07-27 上海大学 Unmanned ship weather self-adaptive obstacle avoidance method based on deep reinforcement learning
CN113093727A (en) * 2021-03-08 2021-07-09 哈尔滨工业大学(深圳) Robot map-free navigation method based on deep security reinforcement learning
CN113110504A (en) * 2021-05-12 2021-07-13 南京云智控产业技术研究院有限公司 Unmanned ship path tracking method based on reinforcement learning and line-of-sight method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种水面无人艇航行及任务载荷复合控制方法;韩佩妤;韩玮;梁旭;;无人系统技术(第03期);50-55 *
一种通过强化学习的四旋翼姿态控制算法;贾振宇;刘子龙;小型微型计算机系统;第42卷(第010期);2074-2078 *
水面无人艇航向的新型变结构全局快速终端滑模控制方法;张晨;薛文涛;侯小燕;江苏科技大学学报(自然科学版)(第003期);383-387, 413 *

Also Published As

Publication number Publication date
CN114077258A (en) 2022-02-22

Similar Documents

Publication Publication Date Title
CN114077258B (en) Unmanned ship pose control method based on reinforcement learning PPO2 algorithm
CN108803321B (en) Autonomous underwater vehicle track tracking control method based on deep reinforcement learning
CN112286218B (en) High angle of attack rock suppression method for aircraft based on deep deterministic policy gradient
CN111340868B (en) Unmanned underwater vehicle autonomous decision control method based on visual depth estimation
CN112132263B (en) Multi-agent autonomous navigation method based on reinforcement learning
CN115256401B (en) Variable impedance control method for shaft hole assembly of space manipulator based on reinforcement learning
CN115793455B (en) Trajectory tracking control method of unmanned boat based on Actor-Critic-Advantage network
CN115755949B (en) A multi-UAV formation cluster control method based on multi-agent deep reinforcement learning
CN116038691A (en) A Continuum Manipulator Motion Control Method Based on Deep Reinforcement Learning
CN116697829B (en) Rocket landing guidance method and system based on deep reinforcement learning
CN109782600A (en) A method of autonomous mobile robot navigation system is established by virtual environment
CN117908565A (en) Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning
CN108181914A (en) A kind of neutral buoyancy robot pose and track Auto-disturbance-rejection Control
CN114396949A (en) Mobile robot no-priori map navigation decision-making method based on DDPG
CN114115262B (en) Multi-AUV actuator saturated collaborative formation control system and method based on azimuth information
CN116430891A (en) A Deep Reinforcement Learning Method for Multi-Agent Path Planning Environment
CN114943182A (en) Robot cable shape control method and device based on graph neural network
Pan et al. Learning for depth control of a robotic penguin: A data-driven model predictive control approach
CN115933712A (en) Bionic fish leader-follower formation control method based on deep reinforcement learning
CN118672287A (en) Hybrid self-adaptive underwater robot track tracking control method
CN115808931B (en) Motion control method, device, system, equipment and storage medium for underwater robot
CN117406762A (en) A UAV remote control algorithm based on segmented reinforcement learning
Niu et al. Deep reinforcement learning from human preferences for ROV path tracking
CN120122721B (en) Unmanned aerial vehicle and unmanned ship formation control method and device based on graph rolling network and deep reinforcement learning
Sola et al. Evaluation of a deep-reinforcement-learning-based controller for the control of an autonomous underwater vehicle

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240229

Address after: Room A529, Building 2, No. 1999 Diamond Road, Weitang Town, Xiangcheng District, Suzhou City, Jiangsu Province, 215100

Patentee after: Suzhou Xiaobo Intelligent Technology Co.,Ltd.

Country or region after: China

Address before: 212100 NO.666, Changhui Road, Dantu District, Zhenjiang City, Jiangsu Province

Patentee before: JIANGSU University OF SCIENCE AND TECHNOLOGY

Country or region before: China

TR01 Transfer of patent right