CN106094516A

CN106094516A - A kind of robot self-adapting grasping method based on deeply study

Info

Publication number: CN106094516A
Application number: CN201610402319.6A
Authority: CN
Inventors: 陈春林; 侯跃南; 刘力锋; 魏青; 徐旭东; 朱张青; 辛博; 马海兰
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-06-08
Filing date: 2016-06-08
Publication date: 2016-11-09

Abstract

The invention provides a robot adaptive grasping method based on deep reinforcement learning. The steps include: when the target is at a certain distance from the target to be grasped, the robot obtains a photo of the target through a front camera, and then uses binocular distance measurement according to the photo The method calculates the position information of the target, and uses the calculated position information for robot navigation; when the target enters the grasping range of the robot arm, it takes a photo of the target through the front camera, and uses the pre-trained DDPG-based The deep reinforcement learning network performs data dimensionality reduction feature extraction on photos; according to the feature extraction results, the robot's control strategy is obtained, and the robot uses the control strategy to control the motion path and the pose of the mechanical arm, so as to achieve adaptive grasping of the target. The grasping method can realize adaptive grasping for objects with different sizes and shapes and unfixed positions, and has a good market application prospect.

Description

A robot adaptive grasping method based on deep reinforcement learning

技术领域technical field

本发明涉及一种机器人抓取物体的方法，尤其是一种基于深度强化学习的机器人自适应抓取方法。The invention relates to a method for a robot to grab an object, in particular to a robot adaptive grabbing method based on deep reinforcement learning.

背景技术Background technique

自主机器人是高度智能化的服务型机器人，具有对外界环境的学习功能。为了实现各种基本活动(如定位、移动、抓取)的功能，需要机器人配有机械手臂和机械手爪并融合多传感器的信息来进行机器学习(如深度学习和强化学习)，与外界环境进行交互，实现其感知、决策和行动等各项功能。现在绝大多数抓取型机器人工作在待抓取物件大小、形状和位置相对固定的情况，并且抓取技术主要是基于超声波、红外和激光测距等传感器，因此使用范围很受限制，无法适应抓取环境更为复杂、抓取物件大小、形状和位置不固定的情况；目前，现有的视觉型机器人技术很难解决输入的视觉信息维度高、数据量大的“维数灾难”问题；并且，利用机器学习训练的神经网络也很难收敛，无法直接处理输入的图像信息。总体来说，现在的视觉型抓取服务机器人的控制技术尚未达到令人满意的结果，尤其在实用中还需要进一步优化。Autonomous robots are highly intelligent service robots that can learn from the external environment. In order to realize the functions of various basic activities (such as positioning, moving, and grasping), it is necessary for the robot to be equipped with a robotic arm and a robotic gripper and to fuse information from multiple sensors for machine learning (such as deep learning and reinforcement learning), and to communicate with the external environment. Interaction, realize its various functions such as perception, decision-making and action. At present, most grasping robots work in the situation where the size, shape and position of the objects to be grasped are relatively fixed, and the grasping technology is mainly based on sensors such as ultrasonic, infrared and laser ranging, so the scope of use is very limited and cannot adapt The grasping environment is more complex, and the size, shape and position of the grasped objects are not fixed; at present, the existing visual robot technology is difficult to solve the problem of "dimension disaster" of high-dimensional input visual information and large amount of data; Moreover, the neural network trained by machine learning is also difficult to converge, and cannot directly process the input image information. Generally speaking, the control technology of the current vision-based grasping service robot has not yet achieved satisfactory results, and further optimization is needed especially in practice.

发明内容Contents of the invention

本发明要解决的技术问题是现有的无法适应抓取环境更为复杂、抓取物件大小、形状和位置不固定的情况。The technical problem to be solved by the present invention is that the existing grasping environment is more complex and the grasping objects are not fixed in size, shape and position.

为了解决上述技术问题，本发明提供了一种基于深度强化学习的机器人自适应抓取方法，包括如下步骤：In order to solve the above technical problems, the present invention provides a robot adaptive grasping method based on deep reinforcement learning, comprising the following steps:

步骤1，在距离待抓取目标一定距离时，机器人通过前部的摄像头获取目标的照片，再根据照片利用双目测距方法计算出目标的位置信息，并将计算出的位置信息用于机器人导航；Step 1. At a certain distance from the target to be grasped, the robot obtains a photo of the target through the front camera, and then calculates the position information of the target using the binocular ranging method based on the photo, and uses the calculated position information for the robot navigation;

步骤2，机器人根据导航进行移动，当目标进入机械手臂抓范围内时，再通过前部的摄像头拍摄目标的照片，并利用预先训练过的基于DDPG的深度强化学习网络对照片进行数据降维特征提取；Step 2. The robot moves according to the navigation. When the target enters the grasping range of the robotic arm, it takes a photo of the target through the front camera, and uses the pre-trained DDPG-based deep reinforcement learning network to perform data dimensionality reduction on the photo. extract;

步骤3，根据特征提取结果得出机器人的控制策略，机器人利用控制策略来控制运动路径和机械手臂的位姿，从而实现目标的自适应抓取。In step 3, the control strategy of the robot is obtained according to the feature extraction results, and the robot uses the control strategy to control the motion path and the pose of the mechanical arm, so as to realize the adaptive grasping of the target.

作为本发明的进一步限定方案，步骤1中根据照片利用双目测距方法计算出目标的位置信息的具体步骤为：As a further limiting solution of the present invention, in step 1, the specific steps for calculating the position information of the target by using the binocular ranging method according to the photo are:

步骤1.1，获取摄像头的焦距f、左右两个摄像头的中心距T_x以及目标点在左右两个摄像头的像平面的投影点到各自像平面最左侧的物理距离x^l和x^r，左右两个摄像头对应的左侧的像平面和右侧的像平面均为矩形平面，且位于同一成像平面上，左右两个摄像头的光心投影分别位于相应像平面的中心处，则视差d为：Step 1.1, obtain the focal length f of the camera, the center distance T _x of the left and right cameras, and the physical distance x ^l and x ^r from the projection point of the target point on the image plane of the left and right cameras to the leftmost side of the respective image plane, and the left and right two The image plane on the left and the image plane on the right corresponding to each camera are both rectangular planes, and they are located on the same imaging plane. The optical center projections of the left and right cameras are respectively located at the centers of the corresponding image planes, then the parallax d is:

d＝x^l-x^r (1)d=x ^l -x ^r (1)

步骤1.2，利用三角形相似原理建立Q矩阵为：Step 1.2, using the triangular similarity principle to establish the Q matrix as:

$Q Q = = [\begin{matrix} 11 & 00 & 00 & - - {c c}_{x x} \\ 00 & 11 & 00 & - - {c c}_{y the y} \\ 00 & 00 & 00 & f f \\ 00 & 00 & - - \frac{11}{{T T}_{x x}} & \frac{{c c}_{x x} - - {c c}_{x x}^{' '}}{{T T}_{x x}} \end{matrix}] - - - - - - ((22))$

$Q Q [\begin{matrix} x x \\ y the y \\ d d \\ 11 \end{matrix}] = = [\begin{matrix} x x - - {c c}_{x x} \\ y the y - - {c c}_{y the y} \\ f f \\ \frac{- - d d + + {c c}_{x x} - - {c c}_{x x}^{' '}}{{T T}_{x x}} \end{matrix}] = = [\begin{matrix} X x \\ Y Y \\ Z Z \\ W W \end{matrix}] - - - - - - ((33))$

式(2)和(3)中，(X,Y,Z)为目标点在以左摄像头光心为原点的立体坐标系中的坐标，W为旋转平移变换比例系数，(x,y)为目标点在左侧的像平面中的坐标，c_x和c_y分别为左侧的像平面和右侧的像平面的坐标系与立体坐标系中原点的偏移量，c_x'为c_x的修正值；In formulas (2) and (3), (X, Y, Z) are the coordinates of the target point in the three-dimensional coordinate system with the optical center of the left camera as the origin, W is the rotation and translation transformation scale coefficient, and (x, y) is The coordinates of the target point in the image plane on the left, c _x and _cy are the offsets between the coordinate system of the image plane on the left and the image plane on the right and the origin in the stereo coordinate system, c _x ' is c _x the correction value;

步骤1.3，计算得到目标点到成像平面的空间距离为：In step 1.3, the calculated spatial distance from the target point to the imaging plane is:

$Z Z = = \frac{- - {T T}_{x x} f f}{d d - - (({c c}_{x x} - - {c c}_{x x}^{' '}))} - - - - - - ((44))$

将左摄像头的光心所在位置作为机器人所在位置，将目标点的坐标位置信息(X,Y,Z)作为导航目的地进行机器人导航。The position of the optical center of the left camera is used as the position of the robot, and the coordinate position information (X, Y, Z) of the target point is used as the navigation destination for robot navigation.

作为本发明的进一步限定方案，步骤2中利用预先训练过的基于DDPG的深度强化学习网络对照片进行数据降维特征提取的具体步骤为：As a further limiting solution of the present invention, in step 2, the specific steps of using the pre-trained deep reinforcement learning network based on DDPG to extract data dimensionality reduction features from photos are:

步骤2.1，利用目标抓取过程符合强化学习且满足马尔科夫性质的条件，计算t时刻之前的观察量和动作的集合为：Step 2.1, using the condition that the target grasping process conforms to reinforcement learning and satisfies the Markov property, the set of observations and actions before time t is calculated as:

s_t＝(x₁,a₁,...,a_t-1,x_t)＝x_t (5)s _t =(x ₁ ,a ₁ ,...,a _t-1 ,x _t )=x _t (5)

式(5)中，x_t和a_t分别为t时刻的观察量以及所采取的动作；In formula (5), x _t and a _t are the observed quantity and the action taken at time t, respectively;

步骤2.2，利用策略值函数来描述抓取过程的预期收益为：In step 2.2, use the policy value function to describe the expected income of the grasping process as:

Q^π(s_t,a_t)＝E[R_t|s_t,a_t] (6)Q ^π (s _t ,a _t )＝E[R _t |s _t ,a _t ] (6)

式(6)中，为时刻t获得的打过折扣以后的未来收益总和，γ∈[0,1]为折扣因子，r(s_t,a_t)为时刻t的收益函数，T为抓取结束的时刻，π为抓取策略；In formula (6), is the sum of discounted future income obtained at time t, γ∈[0,1] is the discount factor, r(s _t , a _t ) is the income function at time t, T is the time when the capture ends, and π is crawl strategy;

由于抓取的目标策略π是预设确定的，记为函数μ:S←A，S为状态空间，A为N维度的动作空间，同时利用贝尔曼方程处理式(6)有：Since the grasping target strategy π is preset and determined, it is recorded as a function μ:S←A, S is the state space, and A is the N-dimensional action space. At the same time, the Bellman equation is used to process formula (6):

${Q Q}^{μ μ} (({s the s}_{t t},, {a a}_{t t})) = = {E E.}_{{s the s}_{t t + + 11} ~ ~ E E.} [[r r (({s the s}_{t t},, {a a}_{t t})) + + {γQ γQ}^{μ μ} (({s the s}_{t t + + 11},, μ μ (({s the s}_{t t + + 11}))))]] - - - - - - ((77))$

式(7)中，s_t+1～E表示t+1时刻的观察量是从环境E中获得的，μ(s_t+1)表示t+1时刻从观察量通过函数μ所映射到的动作；In formula (7), s _t+1 ～E means that the observed amount at time t+1 is obtained from the environment E, and μ(s _t+1 ) means that the observed amount at time t+1 is mapped to by the function μ action;

步骤2.3，利用最大似然估计的原则，通过最小化损失函数来更新网络权重参数为θ^Q的策略评估网络Q(s,a|θ^Q)，所采用的损失函数为：In step 2.3, use the principle of maximum likelihood estimation to update the policy evaluation network Q(s,a|θ ^Q ) with network weight parameters θ ^Q by minimizing the loss function. The loss function used is:

L(θ^Q)＝E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)L(θ ^Q )＝E _μ' [(Q(s _t ,a _t |θ ^Q )-y _t ) ² ] (8)

式(8)中，y_t＝r(s_t,a_t)+γQ(s_t+1,μ(s_t+1)|θ^Q)为目标策略评估网络，μ'为目标策略；In formula (8), y _t =r(s _t ,a _t )+γQ(s _t+1 ,μ(s _t+1 )|θ ^Q ) is the target strategy evaluation network, and μ' is the target strategy;

步骤2.4，对于实际的参数为θ^μ的策略函数μ(s|θ^μ)，利用链式法得到的梯度为：Step 2.4, for the policy function μ(s|θ ^μ ) whose actual parameter is θ ^μ , the gradient obtained by using the chain method is:

$\begin{matrix} {&dtri; &dtri;}_{{θ θ}^{μ μ}} μ μ \approx \approx {E E.}_{{μ μ}^{' '}} [[{&dtri; &dtri;}_{{θ θ}^{μ μ}} Q Q ((s the s,, a a | | {θ θ}^{Q Q})) {| |}_{s the s = = {s the s}_{t t},, a a = = μ μ (({s the s}_{t t} | | {θ θ}^{μ μ}))}]] \\ = = {E E.}_{{μ μ}^{' '}} [[{&dtri; &dtri;}_{a a} Q Q ((s the s,, a a | | {θ θ}^{Q Q})) {| |}_{s the s = = {s the s}_{t t},, a a = = μ μ (({s the s}_{t t}))} {&dtri; &dtri;}_{{θ θ}^{μ μ}} ((s the s | | {θ θ}^{μ μ})) {| |}_{s the s = = {s the s}_{t t}}]] \end{matrix} - - - - - - ((99))$

由式(9)计算得到的梯度即为策略梯度，再利用策略梯度来更新策略函数μ(s|θ^μ)；The gradient calculated by formula (9) is the strategy gradient, and then use the strategy gradient to update the strategy function μ(s|θ ^μ );

步骤2.5，利用离策略算法来训练网络，网络训练中用到的样本数据从同一个样本缓冲区中得到，以最小化样本之间的关联性，同时用一个目标Q值网络来训练神经网络，即采用经验回放机制和目标Q值网络方法对于目标网络的更新，所采用的缓慢更新策略为：Step 2.5, using the off-strategy algorithm to train the network, the sample data used in network training is obtained from the same sample buffer to minimize the correlation between samples, and at the same time use a target Q value network to train the neural network, That is, using the experience playback mechanism and the target Q value network method to update the target network, the slow update strategy adopted is:

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)θ ^Q' ←τθ ^Q +(1-τ)θ ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)θ ^μ' ←τθ ^μ +(1-τ)θ ^μ' (11)

式(10)和(11)中，τ为更新率，τ＜＜1，由此便构建了一个基于DDPG的深度强化学习网络，且为收敛的神经网络；In formulas (10) and (11), τ is the update rate, τ<<1, thus constructing a deep reinforcement learning network based on DDPG, and it is a convergent neural network;

步骤2.6，利用构建好的深度强化学习网络对照片进行数据降维特征提取，获得机器人的控制策略。In step 2.6, use the built deep reinforcement learning network to perform data dimensionality reduction feature extraction on the photos to obtain the control strategy of the robot.

作为本发明的进一步限定方案，步骤2.6中的深度强化学习网络由一个图像输入层、两个卷积层、两个全连接层以及一个输出层构成，图像输入层用于输入包含待抓取物体的图像；卷积层用于提取特征，即一个图像的深层表现形式；全连接层和输出层用于构成一个深层网络，通过训练以后，输入特征信息到该深层网络即可输出控制指令，即控制机器人的机械手臂舵机角度和控制搭载小车的直流电机转速。将所选择的卷积层和全连接层的数量为两个的目的是既可以有效提取图像特征，又可以使得神经网络在训练时便于收敛。As a further limitation of the present invention, the deep reinforcement learning network in step 2.6 consists of an image input layer, two convolutional layers, two fully connected layers, and an output layer, and the image input layer is used to input images containing objects to be grasped image; the convolutional layer is used to extract features, that is, the deep representation of an image; the fully connected layer and the output layer are used to form a deep network. After training, input feature information to the deep network to output control instructions, namely Control the angle of the steering gear of the robot's mechanical arm and control the speed of the DC motor carrying the car. The purpose of setting the number of convolutional layers and fully connected layers to two is to effectively extract image features, and to facilitate the convergence of the neural network during training.

本发明的有益效果在于：(1)预训练神经网络时采用经验回放机制和随机采样确定输入的图像信息可以有效解决照片前后相关度较大不满足神经网络对于输入数据彼此独立要求的问题；(2)通过深度学习实现数据降维，采用目标Q值网络法来不断调整神经网络的权重矩阵，可以尽可能地保证训练的神经网络收敛；(3)已经训练好的基于DDPG的深度强化学习神经网络可以实现数据降维和物件特征提取，并直接给出机器人的运动控制策略，有效解决“维数灾难”问题。The beneficial effects of the present invention are: (1) when pre-training the neural network, adopting the experience playback mechanism and random sampling to determine the input image information can effectively solve the problem that the correlation between the front and back of the photo is relatively large and does not meet the independent requirements of the neural network for the input data; 2) Realize data dimension reduction through deep learning, and use the target Q value network method to continuously adjust the weight matrix of the neural network, which can ensure the convergence of the trained neural network as much as possible; (3) The trained deep reinforcement learning neural network based on DDPG The network can realize data dimension reduction and object feature extraction, and directly provide the robot's motion control strategy, effectively solving the "curse of dimensionality" problem.

附图说明Description of drawings

图1为本发明的系统结构示意图；Fig. 1 is a schematic diagram of the system structure of the present invention;

图2为本发明的方法流程图；Fig. 2 is method flowchart of the present invention;

图3为本发明的双目测距方法平面示意图；Fig. 3 is a schematic plan view of the binocular ranging method of the present invention;

图4为本发明的双目测距技术立体示意图；Fig. 4 is a three-dimensional schematic diagram of the binocular ranging technology of the present invention;

图5为本发明的基于DDPG的深度强化学习网络的构成示意图。FIG. 5 is a schematic diagram of the composition of the DDPG-based deep reinforcement learning network of the present invention.

具体实施方式detailed description

如图1所示，本发明的一种基于深度强化学习方法的机器人自适应抓取的系统包括：图像处理系统、无线通讯系统和机器人运动系统。As shown in FIG. 1 , a robot adaptive grasping system based on a deep reinforcement learning method of the present invention includes: an image processing system, a wireless communication system and a robot motion system.

其中，图像处理系统主要有安装在机器人前部的摄像头和matlab软件构成；无线通讯系统主要由WIFI模块构成；机器人运动系统主要由底座小车和机械手臂构成；首先需要借助动力学仿真平台预训练基于DDPG(深度确定性策略梯度)的深度强化学习网络，在此过程中通常采用经验回放机制和目标Q值网络这两种方法来确保基于DDPG的深度强化学习网络在预训练过程中能收敛，接着图像处理系统获取目标物体的图像，通过无线通讯系统将图像信息传给电脑，在机器人距离待抓取物体较远时，采用双目测距技术，以得到目标物体的位置信息并将其用于机器人的导航。Among them, the image processing system is mainly composed of a camera installed in the front of the robot and matlab software; the wireless communication system is mainly composed of a WIFI module; the robot motion system is mainly composed of a base car and a mechanical arm; The deep reinforcement learning network of DDPG (Deep Deterministic Policy Gradient), in which the experience playback mechanism and the target Q value network are usually used to ensure that the deep reinforcement learning network based on DDPG can converge during the pre-training process, and then The image processing system acquires the image of the target object, and transmits the image information to the computer through the wireless communication system. When the robot is far away from the object to be grasped, the binocular ranging technology is used to obtain the position information of the target object and use it for Robotic navigation.

当机器人移动至机械手臂可以抓到物体时，此时再拍摄物体照片并利用已经训练好的基于DDPG的深度强化学习网络实现数据降维提取特征并给出机器人的控制策略，最后将控制策略通过无线通讯系统传送给机器人运动系统来控制机器人的运动状态，实现目标物体的准确抓取。When the robot moves to the point where the robot arm can grasp the object, it will take a photo of the object and use the trained DDPG-based deep reinforcement learning network to realize data dimensionality reduction and extract features and give the robot's control strategy, and finally pass the control strategy through The wireless communication system transmits to the robot motion system to control the motion state of the robot to achieve accurate grasping of the target object.

预训练时首先利用matlab软件将目标物体的RGB图像转化为灰度图像，再采用经验回放机制，使得照片前后相关度尽可能小以满足神经网络对于输入数据彼此独立的要求，最后通过随机采样来获得输入神经网络的图像；通过深度学习实现数据降维，采用目标Q值网络法来不断调整神经网络的权重矩阵，最终得到收敛的神经网络。During pre-training, first use matlab software to convert the RGB image of the target object into a grayscale image, and then use the experience playback mechanism to make the correlation between the front and back of the photo as small as possible to meet the requirements of the neural network for the independence of input data, and finally through random sampling. Obtain the image of the input neural network; realize data dimensionality reduction through deep learning, and use the target Q value network method to continuously adjust the weight matrix of the neural network, and finally obtain a converged neural network.

机器人的控制用Arduino板实现，板上自带了WIFI模块，机械手臂由4个舵机构成，共实现4个自由度，底座小车由直流电机驱动；图像处理系统主要由摄像头及其图像传输软件和matlab为主；摄像头拍摄到的目标物体的照片将由Arduino板上的WIFI模块传输到电脑，并交由matlab处理。The control of the robot is realized by the Arduino board, which comes with a WIFI module. The robotic arm is composed of 4 steering gears, realizing 4 degrees of freedom in total. The base car is driven by a DC motor; the image processing system is mainly composed of a camera and its image transmission software. Mainly with matlab; the photos of the target object captured by the camera will be transmitted to the computer by the WIFI module on the Arduino board, and will be processed by matlab.

系统在工作时，步骤如下：When the system is working, the steps are as follows:

步骤1，首先需要借助动力学仿真平台预训练基于DDPG(深度确定性策略梯度)的深度强化学习网络，在此过程中通常采用经验回放机制和目标Q值网络这两种方法来确保基于DDPG的深度强化学习网络在预训练过程中能收敛；Step 1, firstly, it is necessary to pre-train a deep reinforcement learning network based on DDPG (Deep Deterministic Policy Gradient) with the help of a dynamics simulation platform. The deep reinforcement learning network can converge during the pre-training process;

步骤2，用安装在机器人前部的摄像头获取目标物体的图像，利用WIFI模块将图像信息传给电脑；Step 2, use the camera installed in the front of the robot to obtain the image of the target object, and use the WIFI module to transmit the image information to the computer;

步骤3，在机器人距离待抓取物体较远时，采用双目测距技术，以得到目标物体的位置信息并将其用于机器人的导航；Step 3, when the robot is far away from the object to be grasped, the binocular ranging technology is used to obtain the position information of the target object and use it for the navigation of the robot;

步骤4，当机器人移动至机械手臂可以抓到物体时，此时再拍摄物体照片并利用已经训练好的基于DDPG的深度强化学习网络实现数据降维提取特征并给出机器人的控制策略；Step 4, when the robot moves to the point where the robot arm can grasp the object, then take a photo of the object and use the trained DDPG-based deep reinforcement learning network to achieve data dimensionality reduction and feature extraction and give the robot's control strategy;

步骤5，利用WIFI模块将控制信息传送给机器人运动系统，实现目标物体的准确抓取；Step 5, use the WIFI module to transmit the control information to the robot motion system to achieve accurate grasping of the target object;

如图3和图4所示，双目测距技术主要利用了目标点在左右两幅视图上成像的横向坐标直接存在的差异(即视差)与目标点到成像平面的距离存在着反比例的关系。一般情况下，焦距的量纲是像素点，摄像头中心距的量纲由定标板棋盘格的实际尺寸和我们的输入值确定，一般是以毫米为单位(为了提高精度我们设置为0.1毫米量级)，视差的量纲也是像素点。因此分子分母约去，目标点到成像平面的距离的量纲与摄像头中心距的相同。As shown in Figure 3 and Figure 4, the binocular distance measurement technology mainly uses the direct difference between the horizontal coordinates of the target point on the left and right views (that is, the parallax) and the distance from the target point to the imaging plane. There is an inverse proportional relationship. . In general, the dimension of the focal length is pixels, and the dimension of the camera center distance is determined by the actual size of the checkerboard on the calibration board and our input value, usually in millimeters (in order to improve the accuracy, we set it to 0.1 mm level), the dimension of parallax is also pixels. Therefore, the numerator and denominator are reduced, and the dimension of the distance from the target point to the imaging plane is the same as that of the camera center distance.

如图5所示，基于DDPG的深度强化学习网络主要由一个图像输入层、两个卷积层、两个全连接层、一个输出层构成。深度网络架构用于实现数据降维，卷积层用于提取特征，输出层输出控制信息。As shown in Figure 5, the DDPG-based deep reinforcement learning network is mainly composed of an image input layer, two convolutional layers, two fully connected layers, and an output layer. The deep network architecture is used to achieve data dimensionality reduction, the convolution layer is used to extract features, and the output layer outputs control information.

如图2所示，本发明提供了一种基于深度强化学习的机器人自适应抓取方法，包括如下步骤：As shown in Figure 2, the present invention provides a robot adaptive grasping method based on deep reinforcement learning, comprising the following steps:

其中，步骤1中根据照片利用双目测距方法计算出目标的位置信息的具体步骤为：Wherein, in step 1, the specific steps for calculating the position information of the target by using the binocular ranging method according to the photos are:

步骤1.1，获取摄像头的焦距f、左右两个摄像头的中心距T_x以及目标点在左右两个摄像头的像平面的投影点到各自像平面最左侧的物理距离x^l和x^r，左右两个摄像头对应的左侧的像平面和右侧的像平面均为矩形平面，且位于同一成像平面上，左右两个摄像头的光心投影分别位于相应像平面的中心处，即O_l、O_r在成像平面的投影点，则视差d为：Step 1.1, obtain the focal length f of the camera, the center distance T _x of the left and right cameras, and the physical distance x ^l and x ^r from the projection point of the target point on the image plane of the left and right cameras to the leftmost side of the respective image plane, and the left and right two The image plane on the left and the image plane on the right corresponding to each camera are both rectangular planes, and they are located on the same imaging plane. The optical center projections of the left and right cameras are respectively located at the centers of the corresponding image planes, that is, O _l , O _r At the projection point of the imaging plane, the parallax d is:

d＝x^l-x^r (1)d=x ^l -x ^r (1)

式(2)和(3)中，(X,Y,Z)为目标点在以左摄像头光心为原点的立体坐标系中的坐标，W为旋转平移变换比例系数，(x,y)为目标点在左侧的像平面中的坐标，c_x和c_y分别为左侧的像平面和右侧的像平面的坐标系与立体坐标系中原点的偏移量，c_x'为c_x的修正值(两者数值一般相差不大，在本发明中可以认为两者近似相等)；In formulas (2) and (3), (X, Y, Z) are the coordinates of the target point in the three-dimensional coordinate system with the optical center of the left camera as the origin, W is the rotation and translation transformation scale coefficient, and (x, y) is The coordinates of the target point in the image plane on the left, c _x and _cy are the offsets between the coordinate system of the image plane on the left and the image plane on the right and the origin in the stereo coordinate system, c _x ' is c _x The correction value of (both numerical values are generally not much different, in the present invention it can be considered that the two are approximately equal);

步骤2中利用预先训练过的基于DDPG的深度强化学习网络对照片进行数据降维特征提取的具体步骤为：In step 2, the specific steps of using the pre-trained DDPG-based deep reinforcement learning network to extract data dimensionality reduction features from photos are as follows:

式(7)中，s_t+1～E表示t+1时刻的观察量是从环境E中获得的，μ(s_t+1)表示t+1In formula (7), s _t+1 ～E means that the observation at time t+1 is obtained from the environment E, and μ(s _t+1 ) means that t+1

时刻从观察量通过函数μ所映射到的动作；Time is mapped to actions from observations through the function μ;

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)θ ^Q' ←τθ ^Q +(1-τ)θ ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)θ ^μ' ←τθ ^μ +(1-τ)θ ^μ' (11)

步骤2.6，利用构建好的深度强化学习网络对照片进行数据降维特征提取，获得机器人的控制策略；深度强化学习网络由一个图像输入层、两个卷积层、两个全连接层以及一个输出层构成，其中，所选择的卷积层和全连接层的数量为两个的目的是既可以有效提取图像特征，又可以使得神经网络在训练时便于收敛；图像输入层用于输入包含待抓取物体的图像；卷积层用于提取特征，即一个图像的深层表现形式，如一些线条、边、弧线等；全连接层和输出层用于构成一个深层网络，通过训练以后，输入特征信息到该网络可以输出控制指令，即控制机器人的机械手臂舵机角度和控制搭载小车的直流电机转速。Step 2.6, use the built deep reinforcement learning network to extract the data dimensionality reduction feature of the photo, and obtain the control strategy of the robot; the deep reinforcement learning network consists of an image input layer, two convolutional layers, two fully connected layers and an output Layer composition, where the number of selected convolutional layers and fully connected layers is two to effectively extract image features and facilitate the convergence of the neural network during training; the image input layer is used to input images containing Take the image of the object; the convolutional layer is used to extract features, that is, the deep representation of an image, such as some lines, edges, arcs, etc.; the fully connected layer and the output layer are used to form a deep network. After training, the input features Information to the network can output control commands, that is, to control the angle of the servo of the robot's mechanical arm and to control the speed of the DC motor carrying the car.

本发明预训练神经网络时采用经验回放机制和随机采样确定输入的图像信息可以有效解决照片前后相关度较大不满足神经网络对于输入数据彼此独立要求的问题；通过深度学习实现数据降维，采用目标Q值网络法来不断调整神经网络的权重矩阵，可以尽可能地保证训练的神经网络收敛；已经训练好的基于DDPG的深度强化学习神经网络可以实现数据降维和物件特征提取，并直接给出机器人的运动控制策略，有效解决“维数灾难”问题。The present invention adopts the experience playback mechanism and random sampling to determine the input image information when pre-training the neural network, which can effectively solve the problem that the large correlation between the front and back of the photos does not meet the independent requirements of the neural network for the input data; realize data dimensionality reduction through deep learning, and use The target Q value network method is used to continuously adjust the weight matrix of the neural network, which can ensure the convergence of the trained neural network as much as possible; the trained deep reinforcement learning neural network based on DDPG can realize data dimensionality reduction and object feature extraction, and directly give The motion control strategy of the robot effectively solves the "curse of dimensionality" problem.

Claims

1. A robot adaptive grasping method based on depth reinforcement learning, is characterized in that, comprises the steps:

Step 1. At a certain distance from the target to be grasped, the robot obtains a photo of the target through the front camera, and then calculates the position information of the target using the binocular ranging method based on the photo, and uses the calculated position information for the robot navigation;

Step 2. The robot moves according to the navigation. When the target enters the grasping range of the robotic arm, it takes a photo of the target through the front camera, and uses the pre-trained DDPG-based deep reinforcement learning network to perform data dimensionality reduction on the photo. extract;

In step 3, the control strategy of the robot is obtained according to the feature extraction results, and the robot uses the control strategy to control the motion path and the pose of the mechanical arm, so as to realize the adaptive grasping of the target.

2. the robot adaptive grasping method based on depth reinforcement learning according to claim 1, is characterized in that, in step 1, utilizes binocular ranging method to calculate the specific steps of the positional information of target according to photo:

Step 1.1, obtain the focal length f of the camera, the center distance T _x of the left and right cameras, and the physical distance x ^l and x ^r from the projection point of the target point on the image plane of the left and right cameras to the leftmost side of the respective image plane, and the left and right two The image plane on the left and the image plane on the right corresponding to each camera are both rectangular planes, and they are located on the same imaging plane. The optical center projections of the left and right cameras are respectively located at the centers of the corresponding image planes, then the parallax d is:

d=x ^l -x ^r (1)

Step 1.2, using the triangular similarity principle to establish the Q matrix as:

Q Q = = [\begin{matrix} 11 & 00 & 00 & - - {c c}_{x x} \\ 00 & 11 & 00 & - - {c c}_{y the y} \\ 00 & 00 & 00 & f f \\ 00 & 00 & - - \frac{11}{{T T}_{x x}} & \frac{{c c}_{x x} - - {c c}_{x x}^{' '}}{{T T}_{x x}} \end{matrix}] - - - - - - ((22))

Q Q [\begin{matrix} x x \\ y the y \\ d d \\ 11 \end{matrix}] = = [\begin{matrix} x x - - {c c}_{x x} \\ y the y - - {c c}_{y the y} \\ f f \\ \frac{- - d d + + {c c}_{x x} - - {c c}_{x x}^{' '}}{{T T}_{x x}} \end{matrix}] = = [\begin{matrix} X x \\ Y Y \\ Z Z \\ W W \end{matrix}] - - - - - - ((33))

In formulas (2) and (3), (X, Y, Z) are the coordinates of the target point in the three-dimensional coordinate system with the optical center of the left camera as the origin, W is the rotation and translation transformation scale coefficient, and (x, y) is The coordinates of the target point in the image plane on the left, c _x and _cy are the offsets between the coordinate system of the image plane on the left and the image plane on the right and the origin in the stereo coordinate system, c _x ' is c _x the correction value;

In step 1.3, the calculated spatial distance from the target point to the imaging plane is:

Z Z = = \frac{- - {T T}_{x x} f f}{d d - - (({c c}_{x x} - - {c c}_{x x}^{' '}))} - - - - - - ((44))

The position of the optical center of the left camera is used as the position of the robot, and the coordinate position information (X, Y, Z) of the target point is used as the navigation destination for robot navigation.

3. The robot adaptive grasping method based on deep reinforcement learning according to claim 1 or 2, characterized in that, in step 2, utilize the pre-trained deep reinforcement learning network based on DDPG to carry out data dimensionality reduction feature extraction to photos The specific steps are:

Step 2.1, using the condition that the target grasping process conforms to reinforcement learning and satisfies the Markov property, the set of observations and actions before time t is calculated as:

s _t =(x ₁ ,a ₁ ,...,a _t-1 ,x _t )=x _t (5)

In formula (5), x _t and a _t are the observed quantity and the action taken at time t, respectively;

In step 2.2, use the policy value function to describe the expected income of the grasping process as:

Q ^π (s _t ,a _t )＝E[R _t |s _t ,a _t ] (6)

In formula (6), is the sum of discounted future income obtained at time t, γ∈[0,1] is the discount factor, r(s _t , a _t ) is the income function at time t, T is the time when the capture ends, and π is crawl strategy;

Since the grasping target strategy π is preset and determined, it is recorded as a function μ:S←A, S is the state space, and A is the N-dimensional action space. At the same time, the Bellman equation is used to process formula (6):

{Q Q}^{μ μ} (({s the s}_{t t},, {a a}_{t t})) = = {E E.}_{{s the s}_{t t + + 11} ~ ~ E E.} [[r r (({s the s}_{t t},, {a a}_{t t})) + + {γQ γQ}^{μ μ} (({s the s}_{t t + + 11},, μ μ (({s the s}_{t t + + 11}))))]] - - - - - - ((77))

In formula (7), s _t+1 ～E means that the observed amount at time t+1 is obtained from the environment E, and μ(s _t+1 ) means that the observed amount at time t+1 is mapped to by the function μ action;

In step 2.3, use the principle of maximum likelihood estimation to update the policy evaluation network Q(s,a|θ ^Q ) with network weight parameters θ ^Q by minimizing the loss function. The loss function used is:

L(θ ^Q )＝E _μ' [(Q(s _t ,a _t |θ ^Q )-y _t ) ² ] (8)

In formula (8), y _t =r(s _t ,a _t )+γQ(s _t+1 ,μ(s _t+1 )|θ ^Q ) is the target strategy evaluation network, and μ' is the target strategy;

Step 2.4, for the policy function μ(s|θ ^μ ) whose actual parameter is θ ^μ , the gradient obtained by using the chain method is:

\begin{matrix} {&dtri; &dtri;}_{{θ θ}^{μ μ}} μ μ \approx \approx {E E.}_{{μ μ}^{' '}} [[{&dtri; &dtri;}_{{θ θ}^{μ μ}} Q Q ((s the s,, a a | | {θ θ}^{Q Q})) {| |}_{s the s = = {s the s}_{t t},, a a = = μ μ (({s the s}_{t t} | | {θ θ}^{μ μ}))}]] \\ = = {E E.}_{{μ μ}^{' '}} [[{&dtri; &dtri;}_{a a} Q Q ((s the s,, a a | | {θ θ}^{Q Q})) {| |}_{s the s = = {s the s}_{t t},, a a = = μ μ (({s the s}_{t t}))} {&dtri; &dtri;}_{{θ θ}^{μ μ}} μ μ ((s the s | | {θ θ}^{μ μ})) {| |}_{s the s = = {s the s}_{t t}}]] \end{matrix} - - - - - - ((99))

The gradient calculated by formula (9) is the strategy gradient, and then use the strategy gradient to update the strategy function μ(s|θ ^μ );

Step 2.5, use the off-strategy algorithm to train the network, the sample data used in the network training is obtained from the same sample buffer to minimize the correlation between samples, and at the same time use a target Q value network to train the neural network, That is, using the experience playback mechanism and the target Q value network method to update the target network, the slow update strategy adopted is:

θ ^Q' ←τθ ^Q +(1-τ)θ ^Q' (10)

θ ^μ' ←τθ ^μ +(1-τ)θ ^μ' (11)

In formulas (10) and (11), τ is the update rate, τ<<1, thus constructing a deep reinforcement learning network based on DDPG, and it is a convergent neural network;

In step 2.6, use the built deep reinforcement learning network to perform data dimensionality reduction feature extraction on the photos to obtain the control strategy of the robot.

4. The robot adaptive grasping method based on depth reinforcement learning according to claim 3, wherein the depth reinforcement learning network in step 2.6 consists of an image input layer, two convolutional layers, and two fully connected layers And an output layer, the image input layer is used to input the image containing the object to be captured; the convolution layer is used to extract features, that is, the deep representation of an image; the fully connected layer and the output layer are used to form a deep network, through After training, input feature information to the deep network to output control commands, that is, to control the angle of the servo of the robotic arm of the robot and to control the speed of the DC motor carrying the car.