CN118493388B

CN118493388B - A deep reinforcement learning robotic grasping method for sparse rewards

Info

Publication number: CN118493388B
Application number: CN202410677163.7A
Authority: CN
Inventors: 杨春雨; 李博论; 韩可可; 刘晓敏; 周林娜; 张鑫; 马磊; 王国庆
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2024-05-29
Filing date: 2024-05-29
Publication date: 2025-03-11
Anticipated expiration: 2044-05-29
Also published as: CN118493388A

Abstract

The invention discloses a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the steps of firstly analyzing mechanical arm grabbing task characteristics, modeling the mechanical arm grabbing task characteristics into Markov decision problems, designing binary sparse rewards, reducing complexity of rewards function design, reducing design cost, secondly taking DDPG algorithm as a main body deep reinforcement learning training algorithm frame, building an Actor-Critic structure network, processing continuous state action space, then designing a post experience playback mechanism, using G-HGG algorithm to assist in target generation, using a pre-training action network to conduct action screening, adding exploration noise and energy functions to process an accumulated experience pool, enhancing experience utilization rate, improving training efficiency and grabbing success rate, and finally building mechanical arm models and scene information, optimizing training by using interactive data, and achieving mechanical arm target grabbing.

Description

Deep reinforcement learning mechanical arm grabbing method for sparse rewards

Technical Field

The invention relates to the field of grabbing of mechanical arms, in particular to a deep reinforcement learning mechanical arm grabbing method for sparse rewards.

Background

The mechanical arm grabbing technology is one of the basic directions in the technical field of robots, plays a key role in the field of industrial automation, can improve production efficiency, reduces labor cost and realizes automatic production. The existing grabbing control can be mainly divided into two types, namely a classical analysis method and a data driving method, wherein the classical analysis method is a traditional target grabbing method, a geometric physical model, a contact model, a rigid body model and the like are required to be established, and kinematics, dynamics and mechanical analysis and calculation are carried out. The limitation is that mathematical and physical modeling in a part of the scene is very complex, and the modeling process is difficult to complete. In recent years, data-driven methods represented by deep reinforcement learning have been rapidly developed, and students at home and abroad have been devoted to the study of data-driven mechanical arm gripping methods.

The deep reinforcement learning is combined with the strong feature extraction capability of the deep learning and the excellent decision capability of the reinforcement learning, and has the advantages of being capable of coping with large-scale complex environments, strong in generalization capability, capable of achieving end-to-end learning and the like. DDPG, SAC, PPO and other Actor-Critic structural algorithms, the problem of continuous state action space can be well solved, and particularly in a grabbing task, the mechanical arm can accurately execute grabbing action in the continuous space. In the field of mechanical arm grabbing learning based on deep reinforcement learning, an intelligent agent optimizes decisions through rewards, for a more complex scene, the intelligent agent is difficult to obtain forward rewards, sparse rewards signals can cause problems of low network convergence speed, even training failure and the like, and a great exploration space is still provided for solving the problems caused by sparse rewards. Therefore, development of a deep reinforcement learning mechanical arm grabbing method oriented to sparse rewards is urgently needed.

Disclosure of Invention

Aiming at the technical defects, the invention aims to provide a sparse rewarding-oriented deep reinforcement learning mechanical arm grabbing method, which uses DDPG algorithm to explore deterministic strategies, can process continuous state action space of the mechanical arm, adds random exploration noise of actions, enhances exploration capacity of the mechanical arm, simplifies learning problems, reduces design cost, has strong interpretability, improves experience utilization rate, improves grabbing success rate, and enhances network convergence speed and training efficiency.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a deep reinforcement learning mechanical arm grabbing method for sparse rewards, which comprises the following steps:

step 1, analyzing the characteristics of a mechanical arm grabbing task, modeling the mechanical arm grabbing task as a Markov decision problem, and designing a binary sparse reward function;

Step 2, based on a deep reinforcement learning DDPG algorithm, constructing a main network and a target network of an Actor-Critic structure;

Step 3, designing a post experience playback mechanism, generating an auxiliary target by using a G-HGG algorithm, performing action screening by using a pre-training action network, adding an exploration noise and energy function, and accumulating an experience playback pool;

And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to complete the target grabbing task.

Preferably, step 1 specifically includes the following:

defining four-element groups according to the characteristics of the grabbing task of the mechanical arm The method comprises the following steps:

S1-1、 the system is a state space, and represents an information set observed by an intelligent body, and specifically represents the positions and speeds of all joints of a mechanical arm, the positions of an end effector, the target positions and the positions and directions of objects;

S1-2、 the operation space is used for representing an operation set executed by an intelligent agent, and is specifically represented as three-dimensional coordinate increment of an end effector and clamping jaw opening and closing;

S1-3, r is a reward function for evaluating the effect of the intelligent agent in executing the action, so as to guide the intelligent agent to learn to reach the expected target, and the reward function adopts the following binary sparse reward function:

wherein g is an algorithm generating auxiliary target, d is the Euclidean distance from the target object to the auxiliary target position, and d _th is the minimum threshold value of the distance;

S1-4 and gamma are discount factors, and determine the importance degree of the agent for future rewards.

Preferably, the step 3 specifically includes the following:

S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S _t||g,a_t,r_t,s_t+1||g,done,p,E_traj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S _t I g is the joint representation of a state S _t of an agent at a moment t and a target g, a _t represents an action taken by the agent at the moment t, r _t represents a reward taken by the agent after taking the action a _t, S _t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a _t in the state S _t, done represents a task identification bit which is a boost type numerical value and represents the completion or non-completion of a current round task, p is the playback priority of a current track, E _traj is the total energy value of the experience track and is used for determining the experience playback priority;

s3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and wherein Where p ₁,p₂ is two possible nodes and ω is a weight;

S3-3, precalculating the shortest distance between two nodes (p ₁,p₂)∈P²)

S3-4, loading a group of pre-training strategiesDefinition of L targetsCorresponding to K pre-training strategies;

s3-5, constructing an auxiliary target set containing M auxiliary targets based on a G-HGG algorithm

S3-6, using the auxiliary target generated in step S3-5Substitute original target (s ₀, g);

s3-7, adjusting the state and the target (S ', g') according to the target o _l at the time t, and performing strategy pre-training In determining candidate actionsAnd store the action to an action setIn (a) and (b);

S3-8, and circulating the step S4-6 for L times;

s3-9, and circulating the step S4-7 for K times;

S3-10, inputting the observed state S _t of the mechanical arm at the time t into an Actor network, and outputting actions according to a deterministic strategy pi (S; theta) And adding a Gaussian noise N _t with a mean value of 0 and a variance of sigma ² to the motion, namely, motion at time tAnd store the action to an action setIn (a) and (b);

S3-11, selecting the optimal action according to the formula (2) Executing the action and obtaining a t+1 moment state s _t+1;

s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;

S3-13, defining a potential energy function E _p(s_t)＝mgz_t, wherein m is the mass of the object, g is approximately equal to 9.81m/S ² represents gravitational acceleration, z _t is the coordinate of the object at the moment t on the z axis, defining a kinetic energy function (3),

Wherein (x _t,y_t,z_t) and (x _t-1,y_t-1,z_t-1) are three-dimensional coordinates of the target object at the time t and t-1 respectively, and Δt is a sampling time interval of the mechanical arm;

defining a state transition energy function as (4):

wherein, For the upper bound of the transfer energy, a set value is set, E (s _t) is a state energy function, and is defined as E (s _t)＝E_p(s_t)+E_k(s_t);

Calculating empirical trace energy E _traj according to an energy function, and defining the empirical trace energy E _traj as formula (5);

S3-14, calculating the playback priority of the experience track according to the step (6)

S3-15, calculating a time t reward r _t, and grouping six tuplesStore to experience poolSampling a group of auxiliary targets g' from the auxiliary target set;

S3-16, calculating an auxiliary reward r _t' at the time t and assisting experience Store to experience pool

S3-17, performing S4-15 circularly aiming at the auxiliary target g';

S3-18, and circulating the steps S4-14 and S4-16 for T times.

Preferably, the step 4 specifically includes the following:

s4-1, setting an environment coordinate system O-XYZ, setting an initial state of the mechanical arm, and determining a working space of the mechanical arm;

S4-2, setting scene information such as obstacles, target objects and the like;

s4-3, determining observation state information and action information of the mechanical arm and determining an update time interval of the mechanical arm;

S4-4, interacting with the environment based on DDPG algorithm, and accumulating an experience playback pool by using the experience pool processing method in the step 3;

S4-5, slave experience playback pool Minibatch with medium sample size batchsize

S4-6 according to p pairThe samples in (1) are ordered, and the playback priority is higher as the p value is larger;

S4-7, according to playback priority Training is carried out in a experience input network, a state-action pair (s, a) is input into a Critic network, the Q value Q (s, a; mu) of the state-action pair is calculated, mu is a Critic network parameter, and the state-action pair is updated by minimizing a loss function;

S4-8, inputting the state S into an Actor network to obtain an output action a based on a deterministic strategy pi (S; theta), wherein theta is an Actor network parameter and is updated through gradient descent;

S4-9, carrying out soft update on the parameters of the target network according to the parameters of the main network:

θ′=τ·θ+(1-τ)·θ′ (7)

μ′=τ·μ+(1-τ)·μ′ (8)

Wherein θ 'and μ' are network parameters including a Target Actor network and a TARGETCRITIC network in the Target network, respectively, τ is the soft update amplitude of the network parameters, τ e (0, 1);

s4-10, updating an experience playback pool, discarding the oldest experience track in the experience pool, and accumulating a group of new experiences;

S4-11, training is circulated until the set round is reached, and training is finished.

The invention has the beneficial effects that:

1) The DDPG algorithm is used for exploring a deterministic strategy, so that a continuous state action space of the mechanical arm can be processed, random exploring noise of actions is added, and the exploring capability of the mechanical arm is enhanced;

2) The reward function signal uses a binary sparse reward form, so that the complexity of the reward function design is reduced, the learning problem is simplified, the design cost is reduced, and the interpretation is strong;

3) Based on post experience playback, the G-HGG algorithm is used for generating auxiliary targets, the pre-training action network is used for action screening, and an energy function is added to process an experience pool, so that experience utilization rate is improved, grabbing success rate is improved, and network convergence speed and training efficiency are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a frame diagram of a deep reinforcement learning mechanical arm grabbing method for sparse rewards;

FIG. 2 is a mujoco simulation environment scenario information diagram constructed in an embodiment;

FIG. 3 is a schematic diagram of an environmental information map model provided in an embodiment;

FIG. 4 is a schematic diagram of the shortest path between nodes provided in an embodiment;

Fig. 5 is a graph of the robotic arm grasping training results in an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a deep reinforcement learning mechanical arm grabbing method for sparse rewards includes the following steps:

And 4, building a mechanical arm model and scene information, and utilizing the interactive data estimation value function and updating the strategy to obtain a deterministic strategy so as to finish the target grabbing task.

The step 1 specifically comprises the following contents:

s1-3 and r are reward functions for evaluating the effect of an agent in executing actions, so as to guide the agent to learn to reach the expected target, and the reward functions adopt the following binary sparse reward functions:

The step 3 specifically comprises the following contents:

S3-1, creating an experience playback pool, adding an energy function into a standard experience pool, storing a seven-tuple (S _t||g,a_t,r_t,s_t+1||g,done,p,E_traj) in the experience pool, training a main network and a target network according to batch sampling in the experience pool, wherein S _t I g is the joint representation of a state S _t of an agent at a moment t and a target g, a _t represents an action taken by the agent at the moment t, r _t represents a reward obtained by the agent after taking the action a _t, S _t+1 represents a state of transfer of the agent at a moment t+1 after taking the action a _t in the state S _t, done is a task identification bit, is a bool type numerical value, represents the completion or non-completion of a current round task, p is the playback priority of a current track, E _traj is the total energy value of the experience track, and is used for determining the experience playback priority.

S3-2, creating a graph model G= (V, E) aiming at environment information, wherein V is a node set in the graph, E is an edge set in the graph, and whereinWhere p ₁,p₂ is two possible nodes and ω is a weight;

S3-6, using the auxiliary target generated in S3-5Substitute original target (s ₀, g);

S3-8, and circulating the step S4-6 for L times;

s3-9, and circulating the step S4-7 for K times;

s3-12, and circulating steps S4-8, S4-9 and S4-10 for T times;

defining a state transition energy function as (4):

S3-17, performing S4-15 circularly aiming at the auxiliary target g';

S3-18, and circulating the steps S4-14 and S4-16 for T times.

The step4 specifically comprises the following contents:

S4-2, setting scene information such as obstacles, target objects and the like;

S4-3, observing state information and action information of the mechanical arm are determined, and an updating time interval of the mechanical arm is determined.

θ′=τ·θ+(1-τ)·θ′ (7)

μ′=τ·μ+(1-τ)·μ′ (8)

In order to better understand the present invention, a detailed description of a deep reinforcement learning mechanical arm grabbing method facing sparse rewards is provided below in conjunction with specific embodiments.

The simulation scene built by mujoco is shown in fig. 2, a 7-degree-of-freedom mechanical arm with the model number of KUKA LBR iiwa 7R800 is used for carrying two fingers, and a grabbing task with an obstacle is executed. The main network Actor-Critic structure of DDPG algorithm is built and adopts MLP network structure, actor network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, tanh function is adopted at output layer to output continuous action strategy, critic network parameters comprise three full-connection hidden layers, each layer contains 256 neurons, reLU function is used as nonlinear activation function of hidden layer, output layer has 1 neuron, no activation function is designated for value function prediction, and Actor-Critic structure of target network is completely consistent with main network, and only network parameters are inconsistent. Creating an experience pool, wherein the data form is seven-tuple (s _t||g,a_t,r_t,s_t+1||g,done,p,E_traj), training a main network and a target network according to batch sampling in the experience pool, wherein s _t ||g represents the joint representation of a state s _t of an agent at a time t and a target g, a _t represents an action taken by the agent at the time t, r _t represents a reward obtained by the agent after taking the action a _t, s _t+1 represents a state of the agent transferred at a time t+1 after taking the action a _t in the state s _t, done represents a task identification bit which is a bool type numerical value representing whether a current round task is completed or not, is a playback priority of a current track, and E _traj is a total energy value of the experience track and is used for determining the experience playback priority. the graph model created for the environmental information is shown in fig. 3, and fig. 4 is a shortest path between nodes calculated using Dijkstra algorithm.

From experience playback poolsMinibatch with medium sample size batchsizeAccording to the pair ofThe samples in the list are ordered, the playback priority is higher as the p value is larger, and the playback priority is set according toThe experience is input into the network for training. Inputting the state-action pair (s, a) into a Critic network, calculating the Q value Q (s, a; mu) of the state-action pair, updating the Critic network parameter mu by minimizing a loss function, inputting the state s into an Actor network, obtaining an output action a based on a deterministic strategy pi (s; theta), and updating the Actor network parameter theta by gradient descent. Soft updating of parameters of the target network with the primary network parameters according to the following formula:

θ′=τ·θ+(1-τ)·θ′ (7)

μ′=τ·μ+(1-τ)·μ′ (8)

Updating the experience playback pool, discarding the oldest experience track in the experience pool, accumulating a group of new experiences, and cycling the training until the training is finished after the round is set.

The training curve obtained after the training is finished is shown in fig. 5, and the training curve shows that the convergence speed is high, the success rate is high, the target grabbing task is realized, and the feasibility of the invention is verified.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A deep reinforcement learning robot arm grasping method for sparse rewards, characterized by comprising the following steps:

Step 1: Analyze the characteristics of the robot grasping task, model it as a Markov decision problem, and design a binary sparse reward function;

Step 2: Based on the deep reinforcement learning DDPG algorithm, build the main network and target network of the Actor-Critic structure;

Step 3: Design a post-experience replay mechanism and use the G-HGG algorithm to generate auxiliary targets. The specific steps include:

S3-1. Create an experience replay pool, add an energy function to the standard experience pool, and store data in the experience pool in the form of a seven-tuple (s _t || g, a _t , r _t , s _t+1 || g, done, p, E _traj ). Train the main network and the target network in the experience pool according to batch sampling, where s _t || g is the joint representation of the state s _t and the target g of the agent at time t, a _t represents the action taken by the agent at time t, r _t represents the reward obtained by the agent after taking action a _t , s _t+1 represents the state transferred at time _t +1 after the agent takes action a _{t in state s t} , done represents the task identification bit, which is a bool value, indicating whether the task of the current round is completed or not, p is the replay priority of the current trajectory, and E _traj is the total energy value of the experience trajectory, which is used to determine the experience replay priority;

S3-2. Create a graph model G = (V, E) for environmental information, where V is the set of nodes in the graph, and E is the set of edges in the graph. Where p ₁ and p ₂ are two possible nodes, and ω is the weight;

S3-3. Use Dijkstra algorithm to pre-calculate the shortest distance between two nodes (p ₁ ,p ₂ )∈P ²

S3-4. Load a set of pre-training strategies Define L goals Corresponding to K pre-training strategies;

S3-5. Based on the G-HGG algorithm, construct a set of M auxiliary targets

S3-6. Use the auxiliary target generated in step S3-5 Replace the original target (s ₀ ,g);

S3-7, at time t, adjust the state and target (s′, g′) according to the target o _l , from the pre-training strategy Determine candidate actions And store the action in the action set _At ;

S3-8, repeat step S4-6 L times;

S3-9, repeat step S4-7 K times;

Then, the pre-trained action network is used for action screening, and exploration noise and energy functions are added to accumulate the experience replay pool;

Step 4: Build the robot arm model and scene information, use the interaction data to estimate the value function and update the strategy to obtain a deterministic strategy to complete the target grasping task.

2. The method according to claim 1, wherein step 1 specifically comprises the following contents:

According to the characteristics of the robot arm grasping task, define the quadruple Specifically:

S1-1, S is the state space, which represents the set of information observed by the agent, specifically the position and speed of all joints of the robot arm, the position of the end effector, the target position, and the position and direction of the object;

S1-2, A is the action space, which represents the set of operations performed by the agent, specifically represented by the three-dimensional coordinate increment of the end effector and the opening and closing of the gripper;

S1-3, r is a reward function, which is used to evaluate the effect of the agent in executing actions, so as to guide the agent to learn to achieve the expected goal. The reward function adopts the following binary sparse reward function:

Where g is the auxiliary target generated by the algorithm, d is the Euclidean distance from the target object to the auxiliary target position, and _dth is the minimum threshold of the distance;

S1-4, γ is the discount factor, which determines the importance the agent attaches to future rewards.

3. The method according to claim 2, characterized in that, in step 3, a pre-trained action network is used for action screening, exploration noise and energy function are added, and an experience replay pool is accumulated, which specifically includes the following contents:

S3-10, input the observed state s _t of the robot arm at time t into the Actor network, and output the action according to the deterministic strategy π(s; θ) And add Gaussian noise N _t with mean 0 and variance σ ² to the action, that is, the action at time t And store the action into the action set _At ;

S3-11. Select the optimal action according to formula (2) Execute the action and obtain the state s t+1 at time _t+1 ;

a ^* =argmaxQ ^π (s _t ,g,A _t )(2)

S3-12, repeat steps S4-8, S4-9 and S4-10 for a total of T times;

S3-13. Define the potential energy function E _p (s _t ) = mg z _t , where m is the mass of the object, g≈9.81 m/s ² represents the acceleration due to gravity, and z _t is the z-axis coordinate of the object at time t; define equation (3) as the kinetic energy function,

Among them, (x _t ,y _t ,z _t ) and (x _t-1 ,y _t-1 ,z _t-1 ) are the three-dimensional coordinates of the target object at time t and t-1 respectively, and Δt is the sampling time interval of the robot arm;

The state transfer energy function is defined as formula (4):

in, is the upper bound of the transfer energy, which is a set value, E(s _t ) is the state energy function, defined as E(s _t )=E _p (s _t )+E _k (s _t );

The empirical trajectory energy E _traj is calculated according to the energy function and is defined as formula (5);

S3-14, calculating the experience trajectory playback priority p(T) according to formula (6);

S3-15. Calculate the reward r _t at time t and convert the six-tuple Store in the experience pool R, and sample a set of auxiliary targets g′ from the auxiliary target set;

S3-16, calculate the auxiliary reward r _t ′ at time t, and convert the auxiliary experience Stored in experience pool R;

S3-17, looping through S4-15 for the auxiliary target g′;

S3-18, loop steps S4-14 and S4-16 for a total of T times.

4. The method according to claim 1, wherein step 4 specifically comprises the following contents:

S4-1, set the environment coordinate system O-XYZ, set the initial state of the robot arm, and determine the working space of the robot arm;

S4-2, setting obstacle and target object scene information;

S4-3, determining the observation state information and action information of the robotic arm, and determining the update time interval of the robotic arm;

S4-4, interact with the environment based on the DDPG algorithm, and use the experience pool processing method in step 3 to accumulate the experience replay pool;

S4-5, sample minibatchB of size batchsize from the experience replay pool R;

S4-6, sort the samples in B according to p, the larger the p value, the higher the playback priority;

S4-7, input the experience in B into the network for training according to the playback priority, input the state-action pair (s, a) into the Critic network, calculate the Q value Q(s, a; μ) of the state-action pair, μ is the Critic network parameter, and is updated by minimizing the loss function;

S4-8, input the state s into the Actor network, and obtain the output action a based on the deterministic strategy π(s; θ), where θ is the Actor network parameter, updated by gradient descent;

S4-9. Soft update the parameters of the target network according to the parameters of the main network:

θ′=τ·θ+(1-τ)·θ′ (7)

μ′＝τ·μ+(1-τ)·μ′ (8)

Where θ′ and μ′ are the network parameters of the target network, including the TargetActor network and the TargetCritic network, respectively, τ is the soft update amplitude of the network parameters, τ∈(0,1);

S4-10, update the experience replay pool, discard the oldest experience track in the experience pool, and accumulate a new set of experience;

S4-11, the training is repeated until the set number of rounds is reached, and then the training ends.