CN118493388B - A deep reinforcement learning robotic grasping method for sparse rewards - Google Patents
A deep reinforcement learning robotic grasping method for sparse rewards Download PDFInfo
- Publication number
- CN118493388B CN118493388B CN202410677163.7A CN202410677163A CN118493388B CN 118493388 B CN118493388 B CN 118493388B CN 202410677163 A CN202410677163 A CN 202410677163A CN 118493388 B CN118493388 B CN 118493388B
- Authority
- CN
- China
- Prior art keywords
- action
- experience
- network
- target
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000002787 reinforcement Effects 0.000 title claims abstract description 19
- 230000009471 action Effects 0.000 claims abstract description 67
- 230000006870 function Effects 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 38
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 238000013461 design Methods 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 6
- 230000007246 mechanism Effects 0.000 claims abstract description 4
- 239000003795 chemical substances by application Substances 0.000 claims description 31
- 238000005070 sampling Methods 0.000 claims description 9
- 239000012636 effector Substances 0.000 claims description 6
- 238000012546 transfer Methods 0.000 claims description 6
- 230000001133 acceleration Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 3
- 238000005381 potential energy Methods 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims 1
- 230000003993 interaction Effects 0.000 claims 1
- 230000008569 process Effects 0.000 abstract description 4
- 230000002452 interceptive effect Effects 0.000 abstract description 3
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 238000012545 processing Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1612—Programme controls characterised by the hand, wrist, grip control
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1656—Programme controls characterised by programming, planning systems for manipulators
- B25J9/1664—Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
Landscapes
- Engineering & Computer Science (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Orthopedic Medicine & Surgery (AREA)
- Manipulator (AREA)
Abstract
The invention discloses a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the steps of firstly analyzing mechanical arm grabbing task characteristics, modeling the mechanical arm grabbing task characteristics into Markov decision problems, designing binary sparse rewards, reducing complexity of rewards function design, reducing design cost, secondly taking DDPG algorithm as a main body deep reinforcement learning training algorithm frame, building an Actor-Critic structure network, processing continuous state action space, then designing a post experience playback mechanism, using G-HGG algorithm to assist in target generation, using a pre-training action network to conduct action screening, adding exploration noise and energy functions to process an accumulated experience pool, enhancing experience utilization rate, improving training efficiency and grabbing success rate, and finally building mechanical arm models and scene information, optimizing training by using interactive data, and achieving mechanical arm target grabbing.
Description
Technical Field
The invention relates to the field of grabbing of mechanical arms, in particular to a deep reinforcement learning mechanical arm grabbing method for sparse rewards.
Background
The mechanical arm grabbing technology is one of the basic directions in the technical field of robots, plays a key role in the field of industrial automation, can improve production efficiency, reduces labor cost and realizes automatic production. The existing grabbing control can be mainly divided into two types, namely a classical analysis method and a data driving method, wherein the classical analysis method is a traditional target grabbing method, a geometric physical model, a contact model, a rigid body model and the like are required to be established, and kinematics, dynamics and mechanical analysis and calculation are carried out. The limitation is that mathematical and physical modeling in a part of the scene is very complex, and the modeling process is difficult to complete. In recent years, data-driven methods represented by deep reinforcement learning have been rapidly developed, and students at home and abroad have been devoted to the study of data-driven mechanical arm gripping methods.
The deep reinforcement learning is combined with the strong feature extraction capability of the deep learning and the excellent decision capability of the reinforcement learning, and has the advantages of being capable of coping with large-scale complex environments, strong in generalization capability, capable of achieving end-to-end learning and the like. DDPG, SAC, PPO and other Actor-Critic structural algorithms, the problem of continuous state action space can be well solved, and particularly in a grabbing task, the mechanical arm can accurately execute grabbing action in the continuous space. In the field of mechanical arm grabbing learning based on deep reinforcement learning, an intelligent agent optimizes decisions through rewards, for a more complex scene, the intelligent agent is difficult to obtain forward rewards, sparse rewards signals can cause problems of low network convergence speed, even training failure and the like, and a great exploration space is still provided for solving the problems caused by sparse rewards. Therefore, development of a deep reinforcement learning mechanical arm grabbing method oriented to sparse rewards is urgently needed.
Disclosure of Invention
Aiming at the technical defects, the invention aims to provide a sparse rewarding-oriented deep reinforcement learning mechanical arm grabbing method, which uses DDPG algorithm to explore deterministic strategies, can process continuous state action space of the mechanical arm, adds random exploration noise of actions, enhances exploration capacity of the mechanical arm, simplifies learning problems, reduces design cost, has strong interpretability, improves experience utilization rate, improves grabbing success rate, and enhances network convergence speed and training efficiency.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the following steps:
step 1, analyzing the characteristics of a mechanical arm grabbing task, modeling the mechanical arm grabbing task as a Markov decision problem, and designing a binary sparse reward function;
Step 2, based on a deep reinforcement learning DDPG algorithm, constructing a main network and a target network of an Actor-Critic structure;
Step 3, designing a post experience playback mechanism, generating an auxiliary target by using a G-HGG algorithm, performing action screening by using a pre-training action network, adding an exploration noise and energy function, and accumulating an experience playback pool;
And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to complete the target grabbing task.
Preferably, step 1 specifically includes the following:
defining four-element groups according to the characteristics of the grabbing task of the mechanical arm The method comprises the following steps:
S1-1、 the system is a state space, and represents an information set observed by an intelligent body, and specifically represents the positions and speeds of all joints of a mechanical arm, the positions of an end effector, the target positions and the positions and directions of objects;
S1-2、 the operation space is used for representing an operation set executed by an intelligent agent, and is specifically represented as three-dimensional coordinate increment of an end effector and clamping jaw opening and closing;
S1-3, r is a reward function for evaluating the effect of the intelligent agent in executing the action, so as to guide the intelligent agent to learn to reach the expected target, and the reward function adopts the following binary sparse reward function:
wherein g is an algorithm generating auxiliary target, d is the Euclidean distance from the target object to the auxiliary target position, and d th is the minimum threshold value of the distance;
S1-4 and gamma are discount factors, and determine the importance degree of the agent for future rewards.
Preferably, the step 3 specifically includes the following:
S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S t||g,at,rt,st+1||g,done,p,Etraj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S t I g is the joint representation of a state S t of an agent at a moment t and a target g, a t represents an action taken by the agent at the moment t, r t represents a reward taken by the agent after taking the action a t, S t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a t in the state S t, done represents a task identification bit which is a boost type numerical value and represents the completion or non-completion of a current round task, p is the playback priority of a current track, E traj is the total energy value of the experience track and is used for determining the experience playback priority;
s3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and wherein Where p 1,p2 is two possible nodes and ω is a weight;
S3-3, precalculating the shortest distance between two nodes (p 1,p2)∈P2)
S3-4, loading a group of pre-training strategiesDefinition of L targetsCorresponding to K pre-training strategies;
s3-5, constructing an auxiliary target set containing M auxiliary targets based on a G-HGG algorithm
S3-6, using the auxiliary target generated in step S3-5Substitute original target (s 0, g);
s3-7, adjusting the state and the target (S ', g') according to the target o l at the time t, and performing strategy pre-training In determining candidate actionsAnd store the action to an action setIn (a) and (b);
S3-8, and circulating the step S4-6 for L times;
s3-9, and circulating the step S4-7 for K times;
S3-10, inputting the observed state S t of the mechanical arm at the time t into an Actor network, and outputting actions according to a deterministic strategy pi (S; theta) And adding a Gaussian noise N t with a mean value of 0 and a variance of sigma 2 to the motion, namely, motion at time tAnd store the action to an action setIn (a) and (b);
S3-11, selecting the optimal action according to the formula (2) Executing the action and obtaining a t+1 moment state s t+1;
s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;
S3-13, defining a potential energy function E p(st)=mgzt, wherein m is the mass of the object, g is approximately equal to 9.81m/S 2 represents gravitational acceleration, z t is the coordinate of the object at the moment t on the z axis, defining a kinetic energy function (3),
Wherein (x t,yt,zt) and (x t-1,yt-1,zt-1) are three-dimensional coordinates of the target object at the time t and t-1 respectively, and Δt is a sampling time interval of the mechanical arm;
defining a state transition energy function as (4):
wherein, For the upper bound of the transfer energy, a set value is set, E (s t) is a state energy function, and is defined as E (s t)=Ep(st)+Ek(st);
Calculating empirical trace energy E traj according to an energy function, and defining the empirical trace energy E traj as formula (5);
S3-14, calculating the playback priority of the experience track according to the step (6)
S3-15, calculating a time t reward r t, and grouping six tuplesStore to experience poolSampling a group of auxiliary targets g' from the auxiliary target set;
S3-16, calculating an auxiliary reward r t' at the time t and assisting experience Store to experience pool
S3-17, performing S4-15 circularly aiming at the auxiliary target g';
S3-18, and circulating the steps S4-14 and S4-16 for T times.
Preferably, the step 4 specifically includes the following:
s4-1, setting an environment coordinate system O-XYZ, setting an initial state of the mechanical arm, and determining a working space of the mechanical arm;
S4-2, setting scene information such as obstacles, target objects and the like;
s4-3, determining observation state information and action information of the mechanical arm and determining an update time interval of the mechanical arm;
S4-4, interacting with the environment based on DDPG algorithm, and accumulating an experience playback pool by using the experience pool processing method in the step 3;
S4-5, slave experience playback pool Minibatch with medium sample size batchsize
S4-6 according to p pairThe samples in (1) are ordered, and the playback priority is higher as the p value is larger;
S4-7, according to playback priority Training is carried out in a experience input network, a state-action pair (s, a) is input into a Critic network, the Q value Q (s, a; mu) of the state-action pair is calculated, mu is a Critic network parameter, and the state-action pair is updated by minimizing a loss function;
S4-8, inputting the state S into an Actor network to obtain an output action a based on a deterministic strategy pi (S; theta), wherein theta is an Actor network parameter and is updated through gradient descent;
S4-9, carrying out soft update on the parameters of the target network according to the parameters of the main network:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Wherein θ 'and μ' are network parameters including a Target Actor network and a TARGETCRITIC network in the Target network, respectively, τ is the soft update amplitude of the network parameters, τ e (0, 1);
s4-10, updating an experience playback pool, discarding the oldest experience track in the experience pool, and accumulating a group of new experiences;
S4-11, training is circulated until the set round is reached, and training is finished.
The invention has the beneficial effects that:
1) The DDPG algorithm is used for exploring a deterministic strategy, so that a continuous state action space of the mechanical arm can be processed, random exploring noise of actions is added, and the exploring capability of the mechanical arm is enhanced;
2) The reward function signal uses a binary sparse reward form, so that the complexity of the reward function design is reduced, the learning problem is simplified, the design cost is reduced, and the interpretation is strong;
3) Based on post experience playback, the G-HGG algorithm is used for generating auxiliary targets, the pre-training action network is used for action screening, and an energy function is added to process an experience pool, so that experience utilization rate is improved, grabbing success rate is improved, and network convergence speed and training efficiency are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a frame diagram of a deep reinforcement learning mechanical arm grabbing method for sparse rewards;
FIG. 2 is a mujoco simulation environment scenario information diagram constructed in an embodiment;
FIG. 3 is a schematic diagram of an environmental information map model provided in an embodiment;
FIG. 4 is a schematic diagram of the shortest path between nodes provided in an embodiment;
Fig. 5 is a graph of the robotic arm grasping training results in an embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a deep reinforcement learning mechanical arm grabbing method for sparse rewards includes the following steps:
step 1, analyzing the characteristics of a mechanical arm grabbing task, modeling the mechanical arm grabbing task as a Markov decision problem, and designing a binary sparse reward function;
step 2, based on a deep reinforcement learning DDPG algorithm, constructing a main network and a target network of an Actor-Critic structure;
Step 3, designing a post experience playback mechanism, generating an auxiliary target by using a G-HGG algorithm, performing action screening by using a pre-training action network, adding an exploration noise and energy function, and accumulating an experience playback pool;
And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to finish the target grabbing task.
The step 1 specifically comprises the following contents:
defining four-element groups according to the characteristics of the grabbing task of the mechanical arm The method comprises the following steps:
S1-1、 the system is a state space, and represents an information set observed by an intelligent body, and specifically represents the positions and speeds of all joints of a mechanical arm, the positions of an end effector, the target positions and the positions and directions of objects;
S1-2、 the operation space is used for representing an operation set executed by an intelligent agent, and is specifically represented as three-dimensional coordinate increment of an end effector and clamping jaw opening and closing;
s1-3 and r are reward functions for evaluating the effect of an agent in executing actions, so as to guide the agent to learn to reach the expected target, and the reward functions adopt the following binary sparse reward functions:
wherein g is an algorithm generating auxiliary target, d is the Euclidean distance from the target object to the auxiliary target position, and d th is the minimum threshold value of the distance;
S1-4 and gamma are discount factors, and determine the importance degree of the agent for future rewards.
The step 3 specifically comprises the following contents:
S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S t||g,at,rt,st+1||g,done,p,Etraj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S t I g is the joint representation of a state S t of an agent at a moment t and a target g, a t represents an action taken by the agent at the moment t, r t represents a reward obtained by the agent after taking the action a t, S t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a t in the state S t, done is a task identification bit, is a bool type numerical value, represents the completion or non-completion of a current round task, p is the playback priority of a current track, E traj is the total energy value of the experience track, and is used for determining the experience playback priority.
S3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and whereinWhere p 1,p2 is two possible nodes and ω is a weight;
S3-3, precalculating the shortest distance between two nodes (p 1,p2)∈P2)
S3-4, loading a group of pre-training strategiesDefinition of L targetsCorresponding to K pre-training strategies;
s3-5, constructing an auxiliary target set containing M auxiliary targets based on a G-HGG algorithm
S3-6, using the auxiliary target generated in S3-5Substitute original target (s 0, g);
s3-7, adjusting the state and the target (S ', g') according to the target o l at the time t, and performing strategy pre-training In determining candidate actionsAnd store the action to an action setIn (a) and (b);
S3-8, and circulating the step S4-6 for L times;
s3-9, and circulating the step S4-7 for K times;
S3-10, inputting the observed state S t of the mechanical arm at the time t into an Actor network, and outputting actions according to a deterministic strategy pi (S; theta) And adding a Gaussian noise N t with a mean value of 0 and a variance of sigma 2 to the motion, namely, motion at time tAnd store the action to an action setIn (a) and (b);
S3-11, selecting the optimal action according to the formula (2) Executing the action and obtaining a t+1 moment state s t+1;
s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;
S3-13, defining a potential energy function E p(st)=mgzt, wherein m is the mass of the object, g is approximately equal to 9.81m/S 2 represents gravitational acceleration, z t is the coordinate of the object at the moment t on the z axis, defining a kinetic energy function (3),
Wherein (x t,yt,zt) and (x t-1,yt-1,zt-1) are three-dimensional coordinates of the target object at the time t and t-1 respectively, and Δt is a sampling time interval of the mechanical arm;
defining a state transition energy function as (4):
wherein, For the upper bound of the transfer energy, a set value is set, E (s t) is a state energy function, and is defined as E (s t)=Ep(st)+Ek(st);
Calculating empirical trace energy E traj according to an energy function, and defining the empirical trace energy E traj as formula (5);
S3-14, calculating the playback priority of the experience track according to the step (6)
S3-15, calculating a time t reward r t, and grouping six tuplesStore to experience poolSampling a group of auxiliary targets g' from the auxiliary target set;
S3-16, calculating an auxiliary reward r t' at the time t and assisting experience Store to experience pool
S3-17, performing S4-15 circularly aiming at the auxiliary target g';
S3-18, and circulating the steps S4-14 and S4-16 for T times.
The step4 specifically comprises the following contents:
s4-1, setting an environment coordinate system O-XYZ, setting an initial state of the mechanical arm, and determining a working space of the mechanical arm;
S4-2, setting scene information such as obstacles, target objects and the like;
S4-3, observing state information and action information of the mechanical arm are determined, and an updating time interval of the mechanical arm is determined.
S4-4, interacting with the environment based on DDPG algorithm, and accumulating an experience playback pool by using the experience pool processing method in the step 3;
S4-5, slave experience playback pool Minibatch with medium sample size batchsize
S4-6 according to p pairThe samples in (1) are ordered, and the playback priority is higher as the p value is larger;
S4-7, according to playback priority Training is carried out in a experience input network, a state-action pair (s, a) is input into a Critic network, the Q value Q (s, a; mu) of the state-action pair is calculated, mu is a Critic network parameter, and the state-action pair is updated by minimizing a loss function;
S4-8, inputting the state S into an Actor network to obtain an output action a based on a deterministic strategy pi (S; theta), wherein theta is an Actor network parameter and is updated through gradient descent;
S4-9, carrying out soft update on the parameters of the target network according to the parameters of the main network:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Wherein θ 'and μ' are network parameters including a Target Actor network and a TARGETCRITIC network in the Target network, respectively, τ is the soft update amplitude of the network parameters, τ e (0, 1);
s4-10, updating an experience playback pool, discarding the oldest experience track in the experience pool, and accumulating a group of new experiences;
S4-11, training is circulated until the set round is reached, and training is finished.
In order to better understand the present invention, a detailed description of a deep reinforcement learning mechanical arm grabbing method facing sparse rewards is provided below in conjunction with specific embodiments.
The simulation scene built by mujoco is shown in fig. 2, a 7-degree-of-freedom mechanical arm with the model number of KUKA LBR iiwa 7R800 is used for carrying two fingers, and a grabbing task with an obstacle is executed. The main network Actor-Critic structure of DDPG algorithm is built and adopts MLP network structure, actor network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, tanh function is adopted at output layer to output continuous action strategy, critic network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, output layer has 1 neuron, no activation function is designated for value function prediction, and Actor-Critic structure of target network is completely consistent with main network, and only network parameters are inconsistent. Creating an experience pool, wherein the data form is seven-tuple (s t||g,at,rt,st+1||g,done,p,Etraj), training a main network and a target network according to batch sampling in the experience pool, wherein s t ||g represents the joint representation of a state s t of an agent at a time t and a target g, a t represents an action taken by the agent at the time t, r t represents a reward obtained by the agent after taking the action a t, s t+1 represents a state of the agent transferred at a time t+1 after taking the action a t in the state s t, done represents a task identification bit which is a bool type numerical value representing whether a current round task is completed or not, is a playback priority of a current track, and E traj is a total energy value of the experience track and is used for determining the experience playback priority. the graph model created for the environmental information is shown in fig. 3, and fig. 4 is a shortest path between nodes calculated using Dijkstra algorithm.
From experience playback poolsMinibatch with medium sample size batchsizeAccording to the pair ofThe samples in the list are ordered, the playback priority is higher as the p value is larger, and the playback priority is set according toThe experience is input into the network for training. Inputting the state-action pair (s, a) into a Critic network, calculating the Q value Q (s, a; mu) of the state-action pair, updating the Critic network parameter mu by minimizing a loss function, inputting the state s into an Actor network, obtaining an output action a based on a deterministic strategy pi (s; theta), and updating the Actor network parameter theta by gradient descent. Soft updating of parameters of the target network with the primary network parameters according to the following formula:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Updating the experience playback pool, discarding the oldest experience track in the experience pool, accumulating a group of new experiences, and cycling the training until the training is finished after the round is set.
The training curve obtained after the training is finished is shown in fig. 5, and the training curve shows that the convergence speed is high, the success rate is high, the target grabbing task is realized, and the feasibility of the invention is verified.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410677163.7A CN118493388B (en) | 2024-05-29 | 2024-05-29 | A deep reinforcement learning robotic grasping method for sparse rewards |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410677163.7A CN118493388B (en) | 2024-05-29 | 2024-05-29 | A deep reinforcement learning robotic grasping method for sparse rewards |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118493388A CN118493388A (en) | 2024-08-16 |
CN118493388B true CN118493388B (en) | 2025-03-11 |
Family
ID=92239443
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410677163.7A Active CN118493388B (en) | 2024-05-29 | 2024-05-29 | A deep reinforcement learning robotic grasping method for sparse rewards |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118493388B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119567243A (en) * | 2024-10-25 | 2025-03-07 | 安徽大学 | Human-robot interaction and anti-collision control method for collaborative robots based on admittance |
CN119427356B (en) * | 2024-11-18 | 2025-06-24 | 东莞理工学院 | Robot tracking control learning method based on posthoc screening experience playback |
CN119871469B (en) * | 2025-03-31 | 2025-06-13 | 苏州元脑智能科技有限公司 | Mechanical arm control method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102405A (en) * | 2020-08-26 | 2020-12-18 | 东南大学 | Robot stirring-grabbing combined method based on deep reinforcement learning |
CN116038691A (en) * | 2022-12-08 | 2023-05-02 | 南京理工大学 | A Continuum Manipulator Motion Control Method Based on Deep Reinforcement Learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3747604B1 (en) * | 2019-06-07 | 2022-01-26 | Robert Bosch GmbH | Robot device controller, robot device arrangement and method for controlling a robot device |
CN116494247A (en) * | 2023-06-14 | 2023-07-28 | 西安电子科技大学广州研究院 | Robotic arm path planning method and system based on deep deterministic policy gradient |
CN117733841A (en) * | 2023-12-06 | 2024-03-22 | 南京邮电大学 | Mechanical arm complex operation skill learning method and system based on generation of countermeasure imitation learning |
-
2024
- 2024-05-29 CN CN202410677163.7A patent/CN118493388B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102405A (en) * | 2020-08-26 | 2020-12-18 | 东南大学 | Robot stirring-grabbing combined method based on deep reinforcement learning |
CN116038691A (en) * | 2022-12-08 | 2023-05-02 | 南京理工大学 | A Continuum Manipulator Motion Control Method Based on Deep Reinforcement Learning |
Also Published As
Publication number | Publication date |
---|---|
CN118493388A (en) | 2024-08-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN118493388B (en) | A deep reinforcement learning robotic grasping method for sparse rewards | |
Zhu | An adaptive agent decision model based on deep reinforcement learning and autonomous learning | |
Cao et al. | Target search control of AUV in underwater environment with deep reinforcement learning | |
CN112102405A (en) | Robot stirring-grabbing combined method based on deep reinforcement learning | |
CN111702754B (en) | A robot obstacle avoidance trajectory planning method and robot based on imitation learning | |
Paxton et al. | Prospection: Interpretable plans from language by predicting the future | |
CN114603564A (en) | Robotic arm navigation and obstacle avoidance method, system, computer equipment and storage medium | |
CN113524186B (en) | Deep reinforcement learning dual-arm robot control method and system based on demonstration examples | |
CN119129642B (en) | Robot agent reinforcement learning training method and system for complex scene | |
Hafez et al. | Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning | |
CN117606490B (en) | A collaborative search path planning method for underwater autonomous vehicles | |
CN114779661A (en) | Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm | |
CN116749194A (en) | A model-based learning method for robot operating skill parameters | |
CN111352419A (en) | Path planning method and system for updating experience playback cache based on time sequence difference | |
CN116360435A (en) | Training method and system for multi-agent cooperative strategy based on episodic memory | |
CN114820802A (en) | High-degree-of-freedom dexterous hand grasping planning method, device and computer equipment | |
CN117103255A (en) | NAO robot object grasping training method based on direct preference optimization | |
CN117555352A (en) | An ocean current assisted path planning method based on discrete SAC | |
Zhang et al. | Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation | |
CN118254170A (en) | Mechanical arm progressive training method based on deep reinforcement learning, storage medium and electronic equipment | |
CN116852347A (en) | A state estimation and decision control method for autonomous grasping of non-cooperative targets | |
CN110726416A (en) | Reinforced learning path planning method based on obstacle area expansion strategy | |
CN116050304B (en) | An intelligent fish flow field simulation control method, system, equipment and storage medium | |
CN117464676A (en) | A robotic arm grabbing method based on improved Dreamer framework | |
Xiang et al. | Research on the robotic arm modeling algorithm based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |