[go: up one dir, main page]

CN118493388B - A deep reinforcement learning robotic grasping method for sparse rewards - Google Patents

A deep reinforcement learning robotic grasping method for sparse rewards Download PDF

Info

Publication number
CN118493388B
CN118493388B CN202410677163.7A CN202410677163A CN118493388B CN 118493388 B CN118493388 B CN 118493388B CN 202410677163 A CN202410677163 A CN 202410677163A CN 118493388 B CN118493388 B CN 118493388B
Authority
CN
China
Prior art keywords
action
experience
network
target
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410677163.7A
Other languages
Chinese (zh)
Other versions
CN118493388A (en
Inventor
杨春雨
李博论
韩可可
刘晓敏
周林娜
张鑫
马磊
王国庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202410677163.7A priority Critical patent/CN118493388B/en
Publication of CN118493388A publication Critical patent/CN118493388A/en
Application granted granted Critical
Publication of CN118493388B publication Critical patent/CN118493388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1612Programme controls characterised by the hand, wrist, grip control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Orthopedic Medicine & Surgery (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the steps of firstly analyzing mechanical arm grabbing task characteristics, modeling the mechanical arm grabbing task characteristics into Markov decision problems, designing binary sparse rewards, reducing complexity of rewards function design, reducing design cost, secondly taking DDPG algorithm as a main body deep reinforcement learning training algorithm frame, building an Actor-Critic structure network, processing continuous state action space, then designing a post experience playback mechanism, using G-HGG algorithm to assist in target generation, using a pre-training action network to conduct action screening, adding exploration noise and energy functions to process an accumulated experience pool, enhancing experience utilization rate, improving training efficiency and grabbing success rate, and finally building mechanical arm models and scene information, optimizing training by using interactive data, and achieving mechanical arm target grabbing.

Description

Deep reinforcement learning mechanical arm grabbing method for sparse rewards
Technical Field
The invention relates to the field of grabbing of mechanical arms, in particular to a deep reinforcement learning mechanical arm grabbing method for sparse rewards.
Background
The mechanical arm grabbing technology is one of the basic directions in the technical field of robots, plays a key role in the field of industrial automation, can improve production efficiency, reduces labor cost and realizes automatic production. The existing grabbing control can be mainly divided into two types, namely a classical analysis method and a data driving method, wherein the classical analysis method is a traditional target grabbing method, a geometric physical model, a contact model, a rigid body model and the like are required to be established, and kinematics, dynamics and mechanical analysis and calculation are carried out. The limitation is that mathematical and physical modeling in a part of the scene is very complex, and the modeling process is difficult to complete. In recent years, data-driven methods represented by deep reinforcement learning have been rapidly developed, and students at home and abroad have been devoted to the study of data-driven mechanical arm gripping methods.
The deep reinforcement learning is combined with the strong feature extraction capability of the deep learning and the excellent decision capability of the reinforcement learning, and has the advantages of being capable of coping with large-scale complex environments, strong in generalization capability, capable of achieving end-to-end learning and the like. DDPG, SAC, PPO and other Actor-Critic structural algorithms, the problem of continuous state action space can be well solved, and particularly in a grabbing task, the mechanical arm can accurately execute grabbing action in the continuous space. In the field of mechanical arm grabbing learning based on deep reinforcement learning, an intelligent agent optimizes decisions through rewards, for a more complex scene, the intelligent agent is difficult to obtain forward rewards, sparse rewards signals can cause problems of low network convergence speed, even training failure and the like, and a great exploration space is still provided for solving the problems caused by sparse rewards. Therefore, development of a deep reinforcement learning mechanical arm grabbing method oriented to sparse rewards is urgently needed.
Disclosure of Invention
Aiming at the technical defects, the invention aims to provide a sparse rewarding-oriented deep reinforcement learning mechanical arm grabbing method, which uses DDPG algorithm to explore deterministic strategies, can process continuous state action space of the mechanical arm, adds random exploration noise of actions, enhances exploration capacity of the mechanical arm, simplifies learning problems, reduces design cost, has strong interpretability, improves experience utilization rate, improves grabbing success rate, and enhances network convergence speed and training efficiency.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the following steps:
step 1, analyzing the characteristics of a mechanical arm grabbing task, modeling the mechanical arm grabbing task as a Markov decision problem, and designing a binary sparse reward function;
Step 2, based on a deep reinforcement learning DDPG algorithm, constructing a main network and a target network of an Actor-Critic structure;
Step 3, designing a post experience playback mechanism, generating an auxiliary target by using a G-HGG algorithm, performing action screening by using a pre-training action network, adding an exploration noise and energy function, and accumulating an experience playback pool;
And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to complete the target grabbing task.
Preferably, step 1 specifically includes the following:
defining four-element groups according to the characteristics of the grabbing task of the mechanical arm The method comprises the following steps:
S1-1、 the system is a state space, and represents an information set observed by an intelligent body, and specifically represents the positions and speeds of all joints of a mechanical arm, the positions of an end effector, the target positions and the positions and directions of objects;
S1-2、 the operation space is used for representing an operation set executed by an intelligent agent, and is specifically represented as three-dimensional coordinate increment of an end effector and clamping jaw opening and closing;
S1-3, r is a reward function for evaluating the effect of the intelligent agent in executing the action, so as to guide the intelligent agent to learn to reach the expected target, and the reward function adopts the following binary sparse reward function:
wherein g is an algorithm generating auxiliary target, d is the Euclidean distance from the target object to the auxiliary target position, and d th is the minimum threshold value of the distance;
S1-4 and gamma are discount factors, and determine the importance degree of the agent for future rewards.
Preferably, the step 3 specifically includes the following:
S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S t||g,at,rt,st+1||g,done,p,Etraj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S t I g is the joint representation of a state S t of an agent at a moment t and a target g, a t represents an action taken by the agent at the moment t, r t represents a reward taken by the agent after taking the action a t, S t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a t in the state S t, done represents a task identification bit which is a boost type numerical value and represents the completion or non-completion of a current round task, p is the playback priority of a current track, E traj is the total energy value of the experience track and is used for determining the experience playback priority;
s3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and wherein Where p 1,p2 is two possible nodes and ω is a weight;
S3-3, precalculating the shortest distance between two nodes (p 1,p2)∈P2)
S3-4, loading a group of pre-training strategiesDefinition of L targetsCorresponding to K pre-training strategies;
s3-5, constructing an auxiliary target set containing M auxiliary targets based on a G-HGG algorithm
S3-6, using the auxiliary target generated in step S3-5Substitute original target (s 0, g);
s3-7, adjusting the state and the target (S ', g') according to the target o l at the time t, and performing strategy pre-training In determining candidate actionsAnd store the action to an action setIn (a) and (b);
S3-8, and circulating the step S4-6 for L times;
s3-9, and circulating the step S4-7 for K times;
S3-10, inputting the observed state S t of the mechanical arm at the time t into an Actor network, and outputting actions according to a deterministic strategy pi (S; theta) And adding a Gaussian noise N t with a mean value of 0 and a variance of sigma 2 to the motion, namely, motion at time tAnd store the action to an action setIn (a) and (b);
S3-11, selecting the optimal action according to the formula (2) Executing the action and obtaining a t+1 moment state s t+1;
s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;
S3-13, defining a potential energy function E p(st)=mgzt, wherein m is the mass of the object, g is approximately equal to 9.81m/S 2 represents gravitational acceleration, z t is the coordinate of the object at the moment t on the z axis, defining a kinetic energy function (3),
Wherein (x t,yt,zt) and (x t-1,yt-1,zt-1) are three-dimensional coordinates of the target object at the time t and t-1 respectively, and Δt is a sampling time interval of the mechanical arm;
defining a state transition energy function as (4):
wherein, For the upper bound of the transfer energy, a set value is set, E (s t) is a state energy function, and is defined as E (s t)=Ep(st)+Ek(st);
Calculating empirical trace energy E traj according to an energy function, and defining the empirical trace energy E traj as formula (5);
S3-14, calculating the playback priority of the experience track according to the step (6)
S3-15, calculating a time t reward r t, and grouping six tuplesStore to experience poolSampling a group of auxiliary targets g' from the auxiliary target set;
S3-16, calculating an auxiliary reward r t' at the time t and assisting experience Store to experience pool
S3-17, performing S4-15 circularly aiming at the auxiliary target g';
S3-18, and circulating the steps S4-14 and S4-16 for T times.
Preferably, the step 4 specifically includes the following:
s4-1, setting an environment coordinate system O-XYZ, setting an initial state of the mechanical arm, and determining a working space of the mechanical arm;
S4-2, setting scene information such as obstacles, target objects and the like;
s4-3, determining observation state information and action information of the mechanical arm and determining an update time interval of the mechanical arm;
S4-4, interacting with the environment based on DDPG algorithm, and accumulating an experience playback pool by using the experience pool processing method in the step 3;
S4-5, slave experience playback pool Minibatch with medium sample size batchsize
S4-6 according to p pairThe samples in (1) are ordered, and the playback priority is higher as the p value is larger;
S4-7, according to playback priority Training is carried out in a experience input network, a state-action pair (s, a) is input into a Critic network, the Q value Q (s, a; mu) of the state-action pair is calculated, mu is a Critic network parameter, and the state-action pair is updated by minimizing a loss function;
S4-8, inputting the state S into an Actor network to obtain an output action a based on a deterministic strategy pi (S; theta), wherein theta is an Actor network parameter and is updated through gradient descent;
S4-9, carrying out soft update on the parameters of the target network according to the parameters of the main network:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Wherein θ 'and μ' are network parameters including a Target Actor network and a TARGETCRITIC network in the Target network, respectively, τ is the soft update amplitude of the network parameters, τ e (0, 1);
s4-10, updating an experience playback pool, discarding the oldest experience track in the experience pool, and accumulating a group of new experiences;
S4-11, training is circulated until the set round is reached, and training is finished.
The invention has the beneficial effects that:
1) The DDPG algorithm is used for exploring a deterministic strategy, so that a continuous state action space of the mechanical arm can be processed, random exploring noise of actions is added, and the exploring capability of the mechanical arm is enhanced;
2) The reward function signal uses a binary sparse reward form, so that the complexity of the reward function design is reduced, the learning problem is simplified, the design cost is reduced, and the interpretation is strong;
3) Based on post experience playback, the G-HGG algorithm is used for generating auxiliary targets, the pre-training action network is used for action screening, and an energy function is added to process an experience pool, so that experience utilization rate is improved, grabbing success rate is improved, and network convergence speed and training efficiency are improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a frame diagram of a deep reinforcement learning mechanical arm grabbing method for sparse rewards;
FIG. 2 is a mujoco simulation environment scenario information diagram constructed in an embodiment;
FIG. 3 is a schematic diagram of an environmental information map model provided in an embodiment;
FIG. 4 is a schematic diagram of the shortest path between nodes provided in an embodiment;
Fig. 5 is a graph of the robotic arm grasping training results in an embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a deep reinforcement learning mechanical arm grabbing method for sparse rewards includes the following steps:
step 1, analyzing the characteristics of a mechanical arm grabbing task, modeling the mechanical arm grabbing task as a Markov decision problem, and designing a binary sparse reward function;
step 2, based on a deep reinforcement learning DDPG algorithm, constructing a main network and a target network of an Actor-Critic structure;
Step 3, designing a post experience playback mechanism, generating an auxiliary target by using a G-HGG algorithm, performing action screening by using a pre-training action network, adding an exploration noise and energy function, and accumulating an experience playback pool;
And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to finish the target grabbing task.
The step 1 specifically comprises the following contents:
defining four-element groups according to the characteristics of the grabbing task of the mechanical arm The method comprises the following steps:
S1-1、 the system is a state space, and represents an information set observed by an intelligent body, and specifically represents the positions and speeds of all joints of a mechanical arm, the positions of an end effector, the target positions and the positions and directions of objects;
S1-2、 the operation space is used for representing an operation set executed by an intelligent agent, and is specifically represented as three-dimensional coordinate increment of an end effector and clamping jaw opening and closing;
s1-3 and r are reward functions for evaluating the effect of an agent in executing actions, so as to guide the agent to learn to reach the expected target, and the reward functions adopt the following binary sparse reward functions:
wherein g is an algorithm generating auxiliary target, d is the Euclidean distance from the target object to the auxiliary target position, and d th is the minimum threshold value of the distance;
S1-4 and gamma are discount factors, and determine the importance degree of the agent for future rewards.
The step 3 specifically comprises the following contents:
S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S t||g,at,rt,st+1||g,done,p,Etraj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S t I g is the joint representation of a state S t of an agent at a moment t and a target g, a t represents an action taken by the agent at the moment t, r t represents a reward obtained by the agent after taking the action a t, S t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a t in the state S t, done is a task identification bit, is a bool type numerical value, represents the completion or non-completion of a current round task, p is the playback priority of a current track, E traj is the total energy value of the experience track, and is used for determining the experience playback priority.
S3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and whereinWhere p 1,p2 is two possible nodes and ω is a weight;
S3-3, precalculating the shortest distance between two nodes (p 1,p2)∈P2)
S3-4, loading a group of pre-training strategiesDefinition of L targetsCorresponding to K pre-training strategies;
s3-5, constructing an auxiliary target set containing M auxiliary targets based on a G-HGG algorithm
S3-6, using the auxiliary target generated in S3-5Substitute original target (s 0, g);
s3-7, adjusting the state and the target (S ', g') according to the target o l at the time t, and performing strategy pre-training In determining candidate actionsAnd store the action to an action setIn (a) and (b);
S3-8, and circulating the step S4-6 for L times;
s3-9, and circulating the step S4-7 for K times;
S3-10, inputting the observed state S t of the mechanical arm at the time t into an Actor network, and outputting actions according to a deterministic strategy pi (S; theta) And adding a Gaussian noise N t with a mean value of 0 and a variance of sigma 2 to the motion, namely, motion at time tAnd store the action to an action setIn (a) and (b);
S3-11, selecting the optimal action according to the formula (2) Executing the action and obtaining a t+1 moment state s t+1;
s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;
S3-13, defining a potential energy function E p(st)=mgzt, wherein m is the mass of the object, g is approximately equal to 9.81m/S 2 represents gravitational acceleration, z t is the coordinate of the object at the moment t on the z axis, defining a kinetic energy function (3),
Wherein (x t,yt,zt) and (x t-1,yt-1,zt-1) are three-dimensional coordinates of the target object at the time t and t-1 respectively, and Δt is a sampling time interval of the mechanical arm;
defining a state transition energy function as (4):
wherein, For the upper bound of the transfer energy, a set value is set, E (s t) is a state energy function, and is defined as E (s t)=Ep(st)+Ek(st);
Calculating empirical trace energy E traj according to an energy function, and defining the empirical trace energy E traj as formula (5);
S3-14, calculating the playback priority of the experience track according to the step (6)
S3-15, calculating a time t reward r t, and grouping six tuplesStore to experience poolSampling a group of auxiliary targets g' from the auxiliary target set;
S3-16, calculating an auxiliary reward r t' at the time t and assisting experience Store to experience pool
S3-17, performing S4-15 circularly aiming at the auxiliary target g';
S3-18, and circulating the steps S4-14 and S4-16 for T times.
The step4 specifically comprises the following contents:
s4-1, setting an environment coordinate system O-XYZ, setting an initial state of the mechanical arm, and determining a working space of the mechanical arm;
S4-2, setting scene information such as obstacles, target objects and the like;
S4-3, observing state information and action information of the mechanical arm are determined, and an updating time interval of the mechanical arm is determined.
S4-4, interacting with the environment based on DDPG algorithm, and accumulating an experience playback pool by using the experience pool processing method in the step 3;
S4-5, slave experience playback pool Minibatch with medium sample size batchsize
S4-6 according to p pairThe samples in (1) are ordered, and the playback priority is higher as the p value is larger;
S4-7, according to playback priority Training is carried out in a experience input network, a state-action pair (s, a) is input into a Critic network, the Q value Q (s, a; mu) of the state-action pair is calculated, mu is a Critic network parameter, and the state-action pair is updated by minimizing a loss function;
S4-8, inputting the state S into an Actor network to obtain an output action a based on a deterministic strategy pi (S; theta), wherein theta is an Actor network parameter and is updated through gradient descent;
S4-9, carrying out soft update on the parameters of the target network according to the parameters of the main network:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Wherein θ 'and μ' are network parameters including a Target Actor network and a TARGETCRITIC network in the Target network, respectively, τ is the soft update amplitude of the network parameters, τ e (0, 1);
s4-10, updating an experience playback pool, discarding the oldest experience track in the experience pool, and accumulating a group of new experiences;
S4-11, training is circulated until the set round is reached, and training is finished.
In order to better understand the present invention, a detailed description of a deep reinforcement learning mechanical arm grabbing method facing sparse rewards is provided below in conjunction with specific embodiments.
The simulation scene built by mujoco is shown in fig. 2, a 7-degree-of-freedom mechanical arm with the model number of KUKA LBR iiwa 7R800 is used for carrying two fingers, and a grabbing task with an obstacle is executed. The main network Actor-Critic structure of DDPG algorithm is built and adopts MLP network structure, actor network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, tanh function is adopted at output layer to output continuous action strategy, critic network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, output layer has 1 neuron, no activation function is designated for value function prediction, and Actor-Critic structure of target network is completely consistent with main network, and only network parameters are inconsistent. Creating an experience pool, wherein the data form is seven-tuple (s t||g,at,rt,st+1||g,done,p,Etraj), training a main network and a target network according to batch sampling in the experience pool, wherein s t ||g represents the joint representation of a state s t of an agent at a time t and a target g, a t represents an action taken by the agent at the time t, r t represents a reward obtained by the agent after taking the action a t, s t+1 represents a state of the agent transferred at a time t+1 after taking the action a t in the state s t, done represents a task identification bit which is a bool type numerical value representing whether a current round task is completed or not, is a playback priority of a current track, and E traj is a total energy value of the experience track and is used for determining the experience playback priority. the graph model created for the environmental information is shown in fig. 3, and fig. 4 is a shortest path between nodes calculated using Dijkstra algorithm.
From experience playback poolsMinibatch with medium sample size batchsizeAccording to the pair ofThe samples in the list are ordered, the playback priority is higher as the p value is larger, and the playback priority is set according toThe experience is input into the network for training. Inputting the state-action pair (s, a) into a Critic network, calculating the Q value Q (s, a; mu) of the state-action pair, updating the Critic network parameter mu by minimizing a loss function, inputting the state s into an Actor network, obtaining an output action a based on a deterministic strategy pi (s; theta), and updating the Actor network parameter theta by gradient descent. Soft updating of parameters of the target network with the primary network parameters according to the following formula:
θ′=τ·θ+(1-τ)·θ′ (7)
μ′=τ·μ+(1-τ)·μ′ (8)
Updating the experience playback pool, discarding the oldest experience track in the experience pool, accumulating a group of new experiences, and cycling the training until the training is finished after the round is set.
The training curve obtained after the training is finished is shown in fig. 5, and the training curve shows that the convergence speed is high, the success rate is high, the target grabbing task is realized, and the feasibility of the invention is verified.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (4)

1.一种面向稀疏奖励的深度强化学习机械臂抓取方法,其特征在于,包括以下步骤:1. A deep reinforcement learning robot arm grasping method for sparse rewards, characterized by comprising the following steps: 步骤1、分析机械臂抓取任务特点,将其建模为马尔可夫决策问题,设计二元稀疏奖励函数;Step 1: Analyze the characteristics of the robot grasping task, model it as a Markov decision problem, and design a binary sparse reward function; 步骤2、基于深度强化学习DDPG算法,搭建Actor-Critic结构的主网络和目标网络;Step 2: Based on the deep reinforcement learning DDPG algorithm, build the main network and target network of the Actor-Critic structure; 步骤3、设计事后经验回放机制,使用G-HGG算法生成辅助目标,其具体步骤包括:Step 3: Design a post-experience replay mechanism and use the G-HGG algorithm to generate auxiliary targets. The specific steps include: S3-1、创建经验回放池,在标准经验池中加入能量函数,经验池中存放数据形式为七元组(st||g,at,rt,st+1||g,done,p,Etraj),在经验池中按照批量采样训练主网络与目标网络,其中st||g为智能体在t时刻的状态st和目标g的联合表示,at表示智能体在t时刻采取的动作,rt表示智能体在采取动作at后获得的奖励,st+1表示智能体在状态st采取动作at后在t+1时刻转移的状态,done表示任务标识位,为bool型数值,代表当前回合任务完成与否,p为当前轨迹的回放优先级,Etraj为该条经验轨迹的总能量值,用于确定经验回放优先级;S3-1. Create an experience replay pool, add an energy function to the standard experience pool, and store data in the experience pool in the form of a seven-tuple (s t || g, a t , r t , s t+1 || g, done, p, E traj ). Train the main network and the target network in the experience pool according to batch sampling, where s t || g is the joint representation of the state s t and the target g of the agent at time t, a t represents the action taken by the agent at time t, r t represents the reward obtained by the agent after taking action a t , s t+1 represents the state transferred at time t +1 after the agent takes action a t in state s t , done represents the task identification bit, which is a bool value, indicating whether the task of the current round is completed or not, p is the replay priority of the current trajectory, and E traj is the total energy value of the experience trajectory, which is used to determine the experience replay priority; S3-2、针对环境信息创建图模型G=(V,E),其中V为图中的节点集合,E为图中的边集合,其中其中p1,p2为两个可能节点,ω为权重;S3-2. Create a graph model G = (V, E) for environmental information, where V is the set of nodes in the graph, and E is the set of edges in the graph. Where p 1 and p 2 are two possible nodes, and ω is the weight; S3-3、使用Dijkstra算法预计算两节点(p1,p2)∈P2间的最短距离 S3-3. Use Dijkstra algorithm to pre-calculate the shortest distance between two nodes (p 1 ,p 2 )∈P 2 S3-4、载入一组预训练策略定义L个目标对应K个预训练策略;S3-4. Load a set of pre-training strategies Define L goals Corresponding to K pre-training strategies; S3-5、基于G-HGG算法,构建一个包含M个辅助目标集合 S3-5. Based on the G-HGG algorithm, construct a set of M auxiliary targets S3-6、使用步骤S3-5中生成的辅助目标替代原目标(s0,g);S3-6. Use the auxiliary target generated in step S3-5 Replace the original target (s 0 ,g); S3-7、在t时刻根据目标ol调整状态和目标(s′,g′),从预训练策略中确定候选动作并将该动作储存到动作集合At中;S3-7, at time t, adjust the state and target (s′, g′) according to the target o l , from the pre-training strategy Determine candidate actions And store the action in the action set At ; S3-8、循环步骤S4-6共L次;S3-8, repeat step S4-6 L times; S3-9、循环步骤S4-7共K次;S3-9, repeat step S4-7 K times; 然后使用预训练动作网络进行动作筛选,加入探索噪声与能量函数,累积经验回放池;Then, the pre-trained action network is used for action screening, and exploration noise and energy functions are added to accumulate the experience replay pool; 步骤4、搭建机械臂模型与场景信息,利用交互数据估计值函数并更新策略,得到确定性策略,完成目标抓取任务。Step 4: Build the robot arm model and scene information, use the interaction data to estimate the value function and update the strategy to obtain a deterministic strategy to complete the target grasping task. 2.如权利要求1所述的方法,其特征在于,步骤1具体包括如下内容:2. The method according to claim 1, wherein step 1 specifically comprises the following contents: 根据机械臂抓取任务特点,定义四元组具体为:According to the characteristics of the robot arm grasping task, define the quadruple Specifically: S1-1、S为状态空间,表示智能体所观测的信息集合,具体表示为机械臂所有关节的位置和速度、末端执行器的位置、目标位置以及物体的位置和方向;S1-1, S is the state space, which represents the set of information observed by the agent, specifically the position and speed of all joints of the robot arm, the position of the end effector, the target position, and the position and direction of the object; S1-2、A为动作空间,表示智能体所执行的操作集合,具体表示为末端执行器的三维坐标增量以及夹爪开合;S1-2, A is the action space, which represents the set of operations performed by the agent, specifically represented by the three-dimensional coordinate increment of the end effector and the opening and closing of the gripper; S1-3、r为奖励函数,用于评估智能体在执行动作的效果,以此引导智能体学习达到预期目标,奖励函数采用如下二元稀疏奖励函数:S1-3, r is a reward function, which is used to evaluate the effect of the agent in executing actions, so as to guide the agent to learn to achieve the expected goal. The reward function adopts the following binary sparse reward function: 其中g为算法生成辅助目标,d为目标物体到辅助目标位置的欧式距离,dth为该距离的最小阈值;Where g is the auxiliary target generated by the algorithm, d is the Euclidean distance from the target object to the auxiliary target position, and dth is the minimum threshold of the distance; S1-4、γ为折扣因子,决定智能体对于未来奖励的重视程度。S1-4, γ is the discount factor, which determines the importance the agent attaches to future rewards. 3.如权利要求2所述的方法,其特征在于,步骤3中,使用预训练动作网络进行动作筛选,加入探索噪声与能量函数,累积经验回放池,具体包括如下内容:3. The method according to claim 2, characterized in that, in step 3, a pre-trained action network is used for action screening, exploration noise and energy function are added, and an experience replay pool is accumulated, which specifically includes the following contents: S3-10、将t时刻机械臂的观测状态st输入到Actor网络,跟据确定性策略π(s;θ)输出动作并且为该动作添加均值为0,方差为σ2的高斯噪声Nt,即t时刻动作并将该动作储存到动作集合At中;S3-10, input the observed state s t of the robot arm at time t into the Actor network, and output the action according to the deterministic strategy π(s; θ) And add Gaussian noise N t with mean 0 and variance σ 2 to the action, that is, the action at time t And store the action into the action set At ; S3-11、根据式(2)选择最优动作执行该动作并获得t+1时刻状态st+1S3-11. Select the optimal action according to formula (2) Execute the action and obtain the state s t+1 at time t+1 ; a*=argmaxQπ(st,g,At)(2)a * =argmaxQ π (s t ,g,A t )(2) S3-12、循环步骤S4-8、S4-9和S4-10共T次;S3-12, repeat steps S4-8, S4-9 and S4-10 for a total of T times; S3-13、定义势能函数Ep(st)=mgzt,其中m为物体质量,g≈9.81m/s2代表重力加速度,zt为t时刻物体在z轴坐标;定义式(3)为动能函数,S3-13. Define the potential energy function E p (s t ) = mg z t , where m is the mass of the object, g≈9.81 m/s 2 represents the acceleration due to gravity, and z t is the z-axis coordinate of the object at time t; define equation (3) as the kinetic energy function, 其中,(xt,yt,zt)和(xt-1,yt-1,zt-1)分别为在t和t-1时刻目标物体的三维坐标,Δt为机械臂的采样时间间隔;Among them, (x t ,y t ,z t ) and (x t-1 ,y t-1 ,z t-1 ) are the three-dimensional coordinates of the target object at time t and t-1 respectively, and Δt is the sampling time interval of the robot arm; 定义状态转移能量函数为式(4):The state transfer energy function is defined as formula (4): 其中,为转移能量上界,是一个设定值,E(st)为状态能量函数,定义为E(st)=Ep(st)+Ek(st);in, is the upper bound of the transfer energy, which is a set value, E(s t ) is the state energy function, defined as E(s t )=E p (s t )+E k (s t ); 根据能量函数计算经验轨迹能量Etraj,定义为式(5);The empirical trajectory energy E traj is calculated according to the energy function and is defined as formula (5); S3-14、根据式(6)计算经验轨迹回放优先级p(T);S3-14, calculating the experience trajectory playback priority p(T) according to formula (6); S3-15、计算t时刻奖励rt,并将六元组储存至经验池R,从辅助目标集合中采样一组辅助目标g′;S3-15. Calculate the reward r t at time t and convert the six-tuple Store in the experience pool R, and sample a set of auxiliary targets g′ from the auxiliary target set; S3-16、计算t时刻辅助奖励rt′,并将辅助经验储存至经验池R;S3-16, calculate the auxiliary reward r t ′ at time t, and convert the auxiliary experience Stored in experience pool R; S3-17、针对辅助目标g′循环执行S4-15;S3-17, looping through S4-15 for the auxiliary target g′; S3-18、循环步骤S4-14和S4-16共T次。S3-18, loop steps S4-14 and S4-16 for a total of T times. 4.如权利要求1所述的方法,其特征在于,步骤4具体包括如下内容:4. The method according to claim 1, wherein step 4 specifically comprises the following contents: S4-1、设定环境坐标系O-XYZ,设置机械臂初始状态,确定机械臂工作空间;S4-1, set the environment coordinate system O-XYZ, set the initial state of the robot arm, and determine the working space of the robot arm; S4-2、设定障碍物与目标物体场景信息;S4-2, setting obstacle and target object scene information; S4-3、确定机械臂的观测状态信息与动作信息,确定机械臂更新时间间隔;S4-3, determining the observation state information and action information of the robotic arm, and determining the update time interval of the robotic arm; S4-4、基于DDPG算法与环境进行交互,使用步骤3中的经验池处理方法累积经验回放池;S4-4, interact with the environment based on the DDPG algorithm, and use the experience pool processing method in step 3 to accumulate the experience replay pool; S4-5、从经验回放池R中采样大小为batchsize的minibatchB;S4-5, sample minibatchB of size batchsize from the experience replay pool R; S4-6、根据p对B中各样本进行排序,p值越大回放优先级越高;S4-6, sort the samples in B according to p, the larger the p value, the higher the playback priority; S4-7、按照回放优先级将B中经验输入网络中进行训练,将状态-动作对(s,a)输入Critic网络,计算状态-动作对的Q值Q(s,a;μ),μ是Critic网络参数,通过最小化损失函数来更新;S4-7, input the experience in B into the network for training according to the playback priority, input the state-action pair (s, a) into the Critic network, calculate the Q value Q(s, a; μ) of the state-action pair, μ is the Critic network parameter, and is updated by minimizing the loss function; S4-8、将状态s输入Actor网络,得到基于确定性策略π(s;θ)的输出动作a,θ是Actor网络参数,通过梯度下降更新;S4-8, input the state s into the Actor network, and obtain the output action a based on the deterministic strategy π(s; θ), where θ is the Actor network parameter, updated by gradient descent; S4-9、根据主网络参数对目标网络的参数进行软更新:S4-9. Soft update the parameters of the target network according to the parameters of the main network: θ′=τ·θ+(1-τ)·θ′ (7)θ′=τ·θ+(1-τ)·θ′ (7) μ′=τ·μ+(1-τ)·μ′ (8)μ′=τ·μ+(1-τ)·μ′ (8) 其中θ′和μ′分别为目标网络中包括TargetActor网络和TargetCritic网络的网路参数,τ为网络参数的软更新幅度,τ∈(0,1);Where θ′ and μ′ are the network parameters of the target network, including the TargetActor network and the TargetCritic network, respectively, τ is the soft update amplitude of the network parameters, τ∈(0,1); S4-10、更新经验回放池,丢弃经验池中最旧的经验轨迹,积累一组新经验;S4-10, update the experience replay pool, discard the oldest experience track in the experience pool, and accumulate a new set of experience; S4-11、循环训练直至达到设定回合,则训练结束。S4-11, the training is repeated until the set number of rounds is reached, and then the training ends.
CN202410677163.7A 2024-05-29 2024-05-29 A deep reinforcement learning robotic grasping method for sparse rewards Active CN118493388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410677163.7A CN118493388B (en) 2024-05-29 2024-05-29 A deep reinforcement learning robotic grasping method for sparse rewards

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410677163.7A CN118493388B (en) 2024-05-29 2024-05-29 A deep reinforcement learning robotic grasping method for sparse rewards

Publications (2)

Publication Number Publication Date
CN118493388A CN118493388A (en) 2024-08-16
CN118493388B true CN118493388B (en) 2025-03-11

Family

ID=92239443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410677163.7A Active CN118493388B (en) 2024-05-29 2024-05-29 A deep reinforcement learning robotic grasping method for sparse rewards

Country Status (1)

Country Link
CN (1) CN118493388B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119567243A (en) * 2024-10-25 2025-03-07 安徽大学 Human-robot interaction and anti-collision control method for collaborative robots based on admittance
CN119427356B (en) * 2024-11-18 2025-06-24 东莞理工学院 Robot tracking control learning method based on posthoc screening experience playback
CN119871469B (en) * 2025-03-31 2025-06-13 苏州元脑智能科技有限公司 Mechanical arm control method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102405A (en) * 2020-08-26 2020-12-18 东南大学 Robot stirring-grabbing combined method based on deep reinforcement learning
CN116038691A (en) * 2022-12-08 2023-05-02 南京理工大学 A Continuum Manipulator Motion Control Method Based on Deep Reinforcement Learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3747604B1 (en) * 2019-06-07 2022-01-26 Robert Bosch GmbH Robot device controller, robot device arrangement and method for controlling a robot device
CN116494247A (en) * 2023-06-14 2023-07-28 西安电子科技大学广州研究院 Robotic arm path planning method and system based on deep deterministic policy gradient
CN117733841A (en) * 2023-12-06 2024-03-22 南京邮电大学 Mechanical arm complex operation skill learning method and system based on generation of countermeasure imitation learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102405A (en) * 2020-08-26 2020-12-18 东南大学 Robot stirring-grabbing combined method based on deep reinforcement learning
CN116038691A (en) * 2022-12-08 2023-05-02 南京理工大学 A Continuum Manipulator Motion Control Method Based on Deep Reinforcement Learning

Also Published As

Publication number Publication date
CN118493388A (en) 2024-08-16

Similar Documents

Publication Publication Date Title
CN118493388B (en) A deep reinforcement learning robotic grasping method for sparse rewards
Zhu An adaptive agent decision model based on deep reinforcement learning and autonomous learning
Cao et al. Target search control of AUV in underwater environment with deep reinforcement learning
CN112102405A (en) Robot stirring-grabbing combined method based on deep reinforcement learning
CN111702754B (en) A robot obstacle avoidance trajectory planning method and robot based on imitation learning
Paxton et al. Prospection: Interpretable plans from language by predicting the future
CN114603564A (en) Robotic arm navigation and obstacle avoidance method, system, computer equipment and storage medium
CN113524186B (en) Deep reinforcement learning dual-arm robot control method and system based on demonstration examples
CN119129642B (en) Robot agent reinforcement learning training method and system for complex scene
Hafez et al. Deep intrinsically motivated continuous actor-critic for efficient robotic visuomotor skill learning
CN117606490B (en) A collaborative search path planning method for underwater autonomous vehicles
CN114779661A (en) Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm
CN116749194A (en) A model-based learning method for robot operating skill parameters
CN111352419A (en) Path planning method and system for updating experience playback cache based on time sequence difference
CN116360435A (en) Training method and system for multi-agent cooperative strategy based on episodic memory
CN114820802A (en) High-degree-of-freedom dexterous hand grasping planning method, device and computer equipment
CN117103255A (en) NAO robot object grasping training method based on direct preference optimization
CN117555352A (en) An ocean current assisted path planning method based on discrete SAC
Zhang et al. Flowpolicy: Enabling fast and robust 3d flow-based policy via consistency flow matching for robot manipulation
CN118254170A (en) Mechanical arm progressive training method based on deep reinforcement learning, storage medium and electronic equipment
CN116852347A (en) A state estimation and decision control method for autonomous grasping of non-cooperative targets
CN110726416A (en) Reinforced learning path planning method based on obstacle area expansion strategy
CN116050304B (en) An intelligent fish flow field simulation control method, system, equipment and storage medium
CN117464676A (en) A robotic arm grabbing method based on improved Dreamer framework
Xiang et al. Research on the robotic arm modeling algorithm based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant