CN112906797A

CN112906797A - Plane grabbing detection method based on computer vision and deep learning

Info

Publication number: CN112906797A
Application number: CN202110207871.0A
Authority: CN
Inventors: 石敏; 路昊; 朱登明; 李兆歆
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-04
Anticipated expiration: 2041-02-25
Also published as: CN112906797B

Abstract

The invention belongs to a plane grasping detection method based on computer vision and deep learning in the field of grasping computing of a mechanical arm, and the steps include: collecting or self-made grasping data sets, and performing specific data enhancement; The algorithm completes the depth map information, and performs depth information fusion, unified cropping, and training and verification division of the data set; according to the grasping detection model obtained by training, using the real image data as the network input, the grasping quality score and the five-dimensional grasping frame are obtained. The representation is used as the output, using the back-propagation algorithm and the standard gradient-based optimization algorithm, and converting it into the four vertex information of the grab frame through sorting optimization, realizing visualization, and finally mapping to real-world coordinates. Minimize the difference between the detected grasping frame and the real value; the invention solves the problem that the empirical grasping detection method is difficult to meet the accuracy in order to achieve generalization; it solves the problem that the grasping detection method is difficult to ensure real-time sexual issues.

Description

Plane grabbing detection method based on computer vision and deep learning

Technical Field

The invention belongs to the technical field of mechanical arm grabbing, and particularly relates to a plane grabbing detection method based on computer vision and deep learning.

Background

The gripping force of the robot is far behind the performance of human beings and is a problem which is not solved in the robot field. When people see novel objects, they instinctively grasp any unknown object quickly and easily based on their own experience. Much of the work related to robotic grasping and manipulation has been expanded in recent years, but real-time grasping detection remains a challenge. The robot grabbing problem can be divided into three successive stages: grip detection, trajectory planning and execution. Grab detection is a visual recognition problem in which a robot uses its sensors to detect a graspable object in its environment. The sensor for perceiving the robot environment is typically a 3D vision system or an RGB-D camera. The key task is to predict potential grab from sensor information and map pixel values to real world coordinates. This is a critical step in performing the gripping, since the subsequent steps depend on the coordinates calculated in this step. The calculated real world coordinates are then converted to the position and orientation of the end of the robot arm tool. An optimal trajectory of the robotic arm is then planned to reach the target gripping location. Subsequently, planning of the robotic arm is performed using either an open-loop or closed-loop controller.

As robots become more intelligent than ever with more and more research, there is an increasing need for a general technique to detect fast and robust grabbing of any object encountered by the robot. One of the most important problems is how to accurately transfer the knowledge learned by the robot to a novel real-world object, which not only requires real-time and accurate algorithm, but also requires generalization due to new development requirements.

The grasping detection is mainly divided into two methods, one is an analytical method and the other is an empirical method. The analysis method refers to that the grabbing pose is limited by designing force closing constraint conditions meeting the conditions of stability, flexibility and the like according to various parameters of the manipulator. Such a method can be understood as a solution and optimization of a constraint problem based on dynamics and geometry. When the grabbing pose meets the force closing condition, the object is clamped by the clamp, and the object does not displace or rotate under the action of static friction force, so that the grabbing stability is maintained. The grabbing pose generated by the analysis method can ensure successful grabbing of the target object, but the method can only be applied to a simple ideal model. The variability of the actual scene, the randomness of the placement of objects, the noise of the image sensor and the like increase the complexity of calculation on one hand, and the calculation accuracy cannot be guaranteed on the other hand. The empirical method is to detect the grabbing pose and judge the reasonability of the grabbing pose by using the information in the knowledge base. And based on the characteristics of the object, the similarity is utilized to carry out classification and pose estimation, thereby achieving the purpose of grabbing. The parameters such as the friction coefficient of the target object and the like do not need to be known in advance like an analysis method, and the robustness is better. But the empirical method usually cannot compromise between accuracy and real-time performance.

Disclosure of Invention

The invention aims to provide a plane grabbing detection method based on computer vision and deep learning, which is characterized by comprising the following steps of:

step 1: collecting or self-making a capture data set, wherein the capture data set comprises RGB images and corresponding annotation information and depth information; carrying out data enhancement of scale transformation, translation, turnover and rotation on the data set, and expanding the data set;

step 2: making and dividing training data according to the expanded data set obtained in the step 1; completing the depth map information by using a depth completion algorithm, and completing the fusion of the RGB image and the depth information; and cutting and scaling the fused image to meet the input format of a capture detection model, and performing the following steps according to the following steps of 9: 1, randomly dividing a training set and a verification set according to the proportion, and respectively training and verifying the captured detection model;

and step 3: training the proposed grabbing detection model by using training data, and optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient so as to minimize the difference between a grabbing frame obtained by detection and a true value; meanwhile, the grabbing detection model is tested by using a verification set to adjust the learning rate in the grabbing detection model training process and avoid overfitting of the grabbing detection model to a certain extent; wherein the objective function is defined as:

L_toral＝L_boxes+L_Q+L_angle

wherein L is_boxesIs a box loss, L_QLoss of mass fraction, l, of grabbing_angleAngle prediction loss;

and 4, step 4: according to the captured detection model obtained through training, real image data is used as network input, captured quality scores and captured frame five-dimensional representation are used as captured detection model output, the optimal is selected through sequencing and converted into information of four top points of a captured frame, visualization is achieved, and the information is mapped to real world coordinates finally;

the five-dimensional grasping representation is widely applied to related work in recent years; five-dimensional grabbing is represented as describing the grab box as:

g＝{，x，y，θ，h，w}

the output of the grabbing detection model is as follows:

g＝{x，y，θ，h，w，Q}

where (x, y) is the center point of the grab frame, h and w are the height and width of the grab frame, respectively, θ is its direction relative to the horizontal axis of the image, Q is the grab quality score, and the probability of grab is evaluated by a value between 0 and 1, with a larger Q indicating a greater feasibility of the grab frame.

The design of the grabbing detection model (network model) comprises a feature extractor at the front end and a grabbing predictor at the rear end; the feature extractor part comprises a convolution module, an attention residual error module, a cross-level local module and other modules which are connected and combined.

The fusion of the RGB image and the depth information comprises depth information extraction, depth map completion and fusion of the RGB image and the supplemented depth map into an RGD image, wherein the RGD image is used as model training data.

Completing fusion of the RGB image and the depth information, wherein the modes for training the RGB data comprise a single-mode training mode and a multi-mode training mode; the RGD data is fused by replacing a B channel in an RGB image with a Depth image, and the design realizes multi-mode, provides more available information and shows good effect in experiments; aiming at the extraction of depth information in a data set, the following formula is designed:

the (x, y, z) is a coordinate in the point cloud information, Max is an upper depth value limit set according to a scene, Min is a lower depth value limit set according to the scene, invalid information can be filtered to a certain extent by the design of limiting a threshold range, in addition, global normalization can be realized, normalization is not performed aiming at a single image, and data are more standardized; the normalized value is enlarged by 255 times, and the scale of the RGB channel value is adjusted to meet the RGD fusion condition.

The method comprises the following steps of testing a capture detection model by using a verification set, realizing channel replacement of a depth image and RGB (red, green and blue), and then performing data amplification on each picture, wherein an amplification strategy is as follows: randomly translating the pixels up and down by 0-50 and randomly rotating the pixels by 0-360 degrees; finally, cutting out a square area with a fixed size along the center to be used as a training image; the prediction of the grab box is performed using a convolution module and a convolution operation. Wherein the position and size information of the grabbing frame is directly regressed; the grabbing quality score is also obtained by a direct regression mode, sigmoid is carried out during final output, the output range of the grabbing quality score obtained through prediction is controlled to be 0-1, and grabbing confidence can be well represented; the angle is predicted by means of classification.

The invention has the beneficial effects that:

1. the problem that the traditional analysis method can only be applied to a simple ideal model and can not realize generalization is solved;

2. the problem that accuracy is difficult to meet in order to realize generalization of an empirical grasping detection method is solved;

3. the problem that the real-time performance of the grabbing detection method in a real scene is difficult to guarantee is solved.

Drawings

Fig. 1 is a schematic view of a suitable robot arm grabbing scene.

FIG. 2 is a five-dimensional representation of a capture box.

Fig. 3 is a schematic view of a grab detection model.

Fig. 4 is a diagram showing the grasping result.

Detailed Description

The invention provides a plane grabbing detection method based on computer vision and deep learning, which comprises the steps of collecting and sorting a source data set or self-making a data set by combining a grabbing target, wherein the data set is subjected to data enhancement of scale transformation, translation, overturning and rotation; and (5) complementing the depth map information by using a depth complementing algorithm, and completing the fusion of the RGB image and the depth information. And cutting and scaling the fused image to meet the input format of the model, and performing the following steps according to the step (9): 1, randomly dividing a training set and a verification set; optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient, so that the difference between the detected capture frame and a true value is minimized; and (4) taking the real image data as the input of a capture detection model to carry out capture detection, and visualizing the result.

The whole mechanical arm grabbing scene is shown in figure 1. The robot mainly comprises a mechanical arm, a parallel two-finger clamp, a depth camera, a computer, a controller, an object to be grabbed, a platform and the like. The depth camera shoots an object to be grabbed on the platform to obtain RGB images and depth information. The computer reads the RGB image and the depth information and processes the RGB image and the depth information, and a feasible grabbing frame is detected from the image information by utilizing an realized grabbing detection algorithm. And the grabbing frame is mapped to a mechanical arm coordinate system and is transmitted to the controller, and grabbing track planning and execution are carried out on the mechanical arm.

The invention mainly aims at the treatment of the visual part of the mechanical arm grabbing problem, and in the past method, five-dimensional grabbing representation is proposed and widely applied to related work in recent years. Five-dimensional grabbing is represented as describing the grab box as:

g＝{x，y，θ，h，w)

the five-dimensional grab is (x, y) as the center point of the grab frame, h and w are the height and width of the grab frame, respectively, and θ is its direction relative to the horizontal axis of the image.

The output of the grabbing detection model is as follows:

g＝{x，y，θ，h，w，Q}

The five-dimensional representation method has smaller dimension and less calculation amount. The feasibility of such a five-dimensional representation is demonstrated in recent work, where the grab can be well represented in an image coordinate system (as shown in fig. 2), h and w being fixed and defined by the shape of the grab, respectively.

The five-dimensional representation can well represent the grab box of a planar scene, but this is limited to the fact that reasonable grab points do exist in the scene. When there is no object in the scene or no object grabbing point is feasible, the grabbing prediction result is also obtained, which is unreasonable. The five-dimensional representation method is expanded, Q is set as a grabbing quality score, the grabbing possibility is evaluated by using a numerical value between 0 and 1, and the greater Q is, the greater the feasibility of the grabbing frame is. Through certain threshold setting, snatching with poor filtering feasibility can be achieved.

The invention provides two training modes, one is directly based on RGB data to train, namely a single-mode training mode; the other training based on the RGD data is a multi-modal training mode. The RGD data is fused by replacing a B channel in an RGB image with a Depth image, and the design realizes multi-mode, provides more available information and shows good effect in experiments. Aiming at the extraction of depth information in a data set, the following formula is designed:

the design of limiting the threshold range can filter invalid information to a certain extent, and in addition, global normalization can be realized, and the normalization is not performed aiming at a single image, so that the data is more standardized. The normalized value is enlarged by 255 times, and the scale of the RGB channel value is adjusted to meet the RGD fusion condition. Due to the limitation of equipment, partial data of the point cloud is often lost, so that a complete depth map cannot be obtained. Aiming at the problem, the invention uses a deep completion method to make a corresponding mask file and uses an NS method in OpenCV to repair the mask file.

Channel replacement is realized on the depth image and RGB, and then data amplification is carried out on each picture, wherein the amplification strategy is as follows: randomly translating 0-50 pixels up and down and randomly rotating 0-360 degrees. Finally, a square area with a fixed size is cut out along the center to be used as a training image. Training images were processed as per 9: the scale of 1 is divided into a training set and a validation set, which are used for training of the model and testing of the model, respectively.

The capture detection model designed by the invention comprises a front-end feature extractor and a rear-end capture predictor, as shown in fig. 3, the deep convolutional network has strong feature extraction capability in the fields of image classification, target detection and the like, and the deep convolutional network is designed to be used as a backbone network for feature extraction. The system is mainly formed by combining and connecting a convolution module, an attention residual error module, a cross-level local module and the like. The network design has enough network depth, contains cross-level connection and has strong feature extraction capability and efficiency.

And the grab predictor part carries out the prediction of the grab frame by utilizing the convolution module and the convolution operation. Wherein the position and size information of the grabbing frame is directly regressed; the grabbing quality score is also obtained by a direct regression mode, sigmoid is carried out during final output, the output range of the grabbing quality score obtained through prediction is controlled to be 0-1, and grabbing confidence can be well represented; the angle is predicted by means of classification.

The loss function of the model of the invention is divided into three parts, where L_boxesIs a box loss, L_QTo capture mass fraction loss, L_angleFor angle prediction loss, three outputs of the network are respectively corresponded. And optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient, so that the difference between the detected capture frame and a true value is minimized.

L_toral＝L_boxes+L_Q+L_angle

And obtaining a capture detection model according to training, using real image data as network input, using the { x, y, 0, h, w } five-dimensional representation of the capture frame as output, and converting the capture frame into four vertex information of the capture frame. Meanwhile, the quality scores of the grabbing frames are sequenced, the grabbing frames with the quality scores exceeding the set threshold are reserved, the grabbing frame with the maximum quality score is output, and visualization is achieved, as shown in fig. 4. And then, calibrating the internal reference and the external reference of the camera by using a Zhang-Zhengyou calibration method, and mapping the pixel points in the image to the three-dimensional coordinate information in the real world.

Compared with other algorithms, the algorithm provided by the invention has higher accuracy and efficiency and shows good effect in a real scene.

The present invention is not limited to the above embodiments, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a plane grasping detection method based on computer vision and deep learning, is characterized in that, comprises:

Step 1: Collect or self-create a data set, including RGB images and corresponding annotation information, and depth information; perform data enhancement of scale transformation, translation, flipping and rotation on the data set to expand the data set;

Step 2: Make and divide the training data according to the expanded data set obtained in Step 1; use the depth completion algorithm to complete the depth map information, and complete the fusion of the RGB image and the depth information; Meet the input format of the grab box, and randomly divide the training set and the verification set according to the ratio of 9:1, which are used for the training and verification of the grab detection model respectively;

Step 3: Use the training data to train the proposed grasping detection model, and use the back-propagation algorithm and the standard gradient-based optimization algorithm to optimize the gradient of the objective function, so as to minimize the difference between the detected grasping frame and the real value; Use the verification set to test the grasping detection model to adjust the learning rate in the training process of the grasping detection model, and avoid overfitting of the grasping detection model to a certain extent; the definition of the objective function is:

L _total =L _boxes +L _Q +L _angle

Among them, L _boxes are boxes loss, L _Q capture quality score loss, and L _angle angle prediction loss;

Step 4: According to the grasping detection model obtained by training, the real image data is used as the network input, the grasping quality score and the five-dimensional representation of the grasping frame are used as the output of the grasping detection model, and the optimal one is selected by sorting and converted into grasping Box four vertex information, realize visualization, and finally map to real world coordinates.

2. The plane grasping detection method based on computer vision and deep learning according to claim 1, wherein the five-dimensional grasping representation has been widely used in related work in recent years; the five-dimensional grasping representation To describe the grab box as:

g={x,y,,theta,h,w}

The output of the grab detection model is:

g={x, y, θ, h, w, Q}

where (x, y) is the center point of the grabbing frame, h and w are the height and width of the grabbing frame, respectively, θ is the direction relative to the horizontal axis of the image, and Q is the grabbing quality score, which ranges from 0 to 1. A value between is used to evaluate the possibility of grasping, and the larger Q is, the greater the possibility of grasping the frame.

3. the plane grasping detection method based on computer vision and deep learning according to claim 1, is characterized in that, described grasping detection model (network model) design comprises the feature extractor of front-end and the grasping predictor of back-end; The part of the feature extractor design includes the convolution module, the attention residual module and the cross-level local module and other modules connected and combined.

4. the plane grasping detection method based on computer vision and deep learning according to claim 1, is characterized in that, described completes the fusion of RGB image and depth information and comprises depth information extraction, depth map complement and RGB image and complement. The full depth map is fused into an RGD image, where the RGD image will be used as model training data.

5. the plane grabbing detection method based on computer vision and deep learning according to claim 1, is characterized in that, described completes the fusion of RGB image and depth information, the mode that RGB data is trained comprises single-modal training mode and Multi-modal training method; in which RGD data is fused by replacing the B channel in the RGB image with the Depth image. This design realizes multi-modality, provides more available information, and shows good results in experiments ; For the depth information extraction in the dataset, the following formula is designed:

Among them (x, y, z) are the coordinates in the point cloud information, Max is the upper limit of the depth value set according to the scene, Min is the lower limit of the depth value set according to the scene, the design of limiting the threshold range can filter invalid to a certain extent In addition, it can achieve global normalization instead of normalization for a single image, which makes the data more standardized; the normalized value is expanded by 255 times and adjusted to the scale of the RGB channel value to meet the conditions of RGD fusion. .

6. The plane grabbing detection method based on computer vision and deep learning according to claim 1, is characterized in that, described utilizing the verification set to test the grabbing detection model, depth image and RGB realization channel are replaced, each will be subsequently replaced. Amplify the data of a single image, and the amplification strategy: randomly translate 0-50 pixels up and down, and randomly rotate 0-360°; finally intercept a fixed-size square area along the center as a training image; use the convolution module and convolution operation to capture Box predictions. The grabbing frame position and size information are directly regressed; the grabbing quality score also uses the direct regression method, but sigmoid is performed in the final output, and the output range of the predicted grabbing quality score is controlled within 0-1. , the grasping confidence can be well represented; the angle is predicted by classification.