Plane grabbing detection method based on computer vision and deep learning
Technical Field
The invention belongs to the technical field of mechanical arm grabbing, and particularly relates to a plane grabbing detection method based on computer vision and deep learning.
Background
The gripping force of the robot is far behind the performance of human beings and is a problem which is not solved in the robot field. When people see novel objects, they instinctively grasp any unknown object quickly and easily based on their own experience. Much of the work related to robotic grasping and manipulation has been expanded in recent years, but real-time grasping detection remains a challenge. The robot grabbing problem can be divided into three successive stages: grip detection, trajectory planning and execution. Grab detection is a visual recognition problem in which a robot uses its sensors to detect a graspable object in its environment. The sensor for perceiving the robot environment is typically a 3D vision system or an RGB-D camera. The key task is to predict potential grab from sensor information and map pixel values to real world coordinates. This is a critical step in performing the gripping, since the subsequent steps depend on the coordinates calculated in this step. The calculated real world coordinates are then converted to the position and orientation of the end of the robot arm tool. An optimal trajectory of the robotic arm is then planned to reach the target gripping location. Subsequently, planning of the robotic arm is performed using either an open-loop or closed-loop controller.
As robots become more intelligent than ever with more and more research, there is an increasing need for a general technique to detect fast and robust grabbing of any object encountered by the robot. One of the most important problems is how to accurately transfer the knowledge learned by the robot to a novel real-world object, which not only requires real-time and accurate algorithm, but also requires generalization due to new development requirements.
The grasping detection is mainly divided into two methods, one is an analytical method and the other is an empirical method. The analysis method refers to that the grabbing pose is limited by designing force closing constraint conditions meeting the conditions of stability, flexibility and the like according to various parameters of the manipulator. Such a method can be understood as a solution and optimization of a constraint problem based on dynamics and geometry. When the grabbing pose meets the force closing condition, the object is clamped by the clamp, and the object does not displace or rotate under the action of static friction force, so that the grabbing stability is maintained. The grabbing pose generated by the analysis method can ensure successful grabbing of the target object, but the method can only be applied to a simple ideal model. The variability of the actual scene, the randomness of the placement of objects, the noise of the image sensor and the like increase the complexity of calculation on one hand, and the calculation accuracy cannot be guaranteed on the other hand. The empirical method is to detect the grabbing pose and judge the reasonability of the grabbing pose by using the information in the knowledge base. And based on the characteristics of the object, the similarity is utilized to carry out classification and pose estimation, thereby achieving the purpose of grabbing. The parameters such as the friction coefficient of the target object and the like do not need to be known in advance like an analysis method, and the robustness is better. But the empirical method usually cannot compromise between accuracy and real-time performance.
Disclosure of Invention
The invention aims to provide a plane grabbing detection method based on computer vision and deep learning, which is characterized by comprising the following steps of:
step 1: collecting or self-making a capture data set, wherein the capture data set comprises RGB images and corresponding annotation information and depth information; carrying out data enhancement of scale transformation, translation, turnover and rotation on the data set, and expanding the data set;
step 2: making and dividing training data according to the expanded data set obtained in the step 1; completing the depth map information by using a depth completion algorithm, and completing the fusion of the RGB image and the depth information; and cutting and scaling the fused image to meet the input format of a capture detection model, and performing the following steps according to the following steps of 9: 1, randomly dividing a training set and a verification set according to the proportion, and respectively training and verifying the captured detection model;
and step 3: training the proposed grabbing detection model by using training data, and optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient so as to minimize the difference between a grabbing frame obtained by detection and a true value; meanwhile, the grabbing detection model is tested by using a verification set to adjust the learning rate in the grabbing detection model training process and avoid overfitting of the grabbing detection model to a certain extent; wherein the objective function is defined as:
Ltoral=Lboxes+LQ+Langle
wherein L isboxesIs a box loss, LQLoss of mass fraction, l, of grabbingangleAngle prediction loss;
and 4, step 4: according to the captured detection model obtained through training, real image data is used as network input, captured quality scores and captured frame five-dimensional representation are used as captured detection model output, the optimal is selected through sequencing and converted into information of four top points of a captured frame, visualization is achieved, and the information is mapped to real world coordinates finally;
the five-dimensional grasping representation is widely applied to related work in recent years; five-dimensional grabbing is represented as describing the grab box as:
g={,x,y,θ,h,w}
the output of the grabbing detection model is as follows:
g={x,y,θ,h,w,Q}
where (x, y) is the center point of the grab frame, h and w are the height and width of the grab frame, respectively, θ is its direction relative to the horizontal axis of the image, Q is the grab quality score, and the probability of grab is evaluated by a value between 0 and 1, with a larger Q indicating a greater feasibility of the grab frame.
The design of the grabbing detection model (network model) comprises a feature extractor at the front end and a grabbing predictor at the rear end; the feature extractor part comprises a convolution module, an attention residual error module, a cross-level local module and other modules which are connected and combined.
The fusion of the RGB image and the depth information comprises depth information extraction, depth map completion and fusion of the RGB image and the supplemented depth map into an RGD image, wherein the RGD image is used as model training data.
Completing fusion of the RGB image and the depth information, wherein the modes for training the RGB data comprise a single-mode training mode and a multi-mode training mode; the RGD data is fused by replacing a B channel in an RGB image with a Depth image, and the design realizes multi-mode, provides more available information and shows good effect in experiments; aiming at the extraction of depth information in a data set, the following formula is designed:
the (x, y, z) is a coordinate in the point cloud information, Max is an upper depth value limit set according to a scene, Min is a lower depth value limit set according to the scene, invalid information can be filtered to a certain extent by the design of limiting a threshold range, in addition, global normalization can be realized, normalization is not performed aiming at a single image, and data are more standardized; the normalized value is enlarged by 255 times, and the scale of the RGB channel value is adjusted to meet the RGD fusion condition.
The method comprises the following steps of testing a capture detection model by using a verification set, realizing channel replacement of a depth image and RGB (red, green and blue), and then performing data amplification on each picture, wherein an amplification strategy is as follows: randomly translating the pixels up and down by 0-50 and randomly rotating the pixels by 0-360 degrees; finally, cutting out a square area with a fixed size along the center to be used as a training image; the prediction of the grab box is performed using a convolution module and a convolution operation. Wherein the position and size information of the grabbing frame is directly regressed; the grabbing quality score is also obtained by a direct regression mode, sigmoid is carried out during final output, the output range of the grabbing quality score obtained through prediction is controlled to be 0-1, and grabbing confidence can be well represented; the angle is predicted by means of classification.
The invention has the beneficial effects that:
1. the problem that the traditional analysis method can only be applied to a simple ideal model and can not realize generalization is solved;
2. the problem that accuracy is difficult to meet in order to realize generalization of an empirical grasping detection method is solved;
3. the problem that the real-time performance of the grabbing detection method in a real scene is difficult to guarantee is solved.
Drawings
Fig. 1 is a schematic view of a suitable robot arm grabbing scene.
FIG. 2 is a five-dimensional representation of a capture box.
Fig. 3 is a schematic view of a grab detection model.
Fig. 4 is a diagram showing the grasping result.
Detailed Description
The invention provides a plane grabbing detection method based on computer vision and deep learning, which comprises the steps of collecting and sorting a source data set or self-making a data set by combining a grabbing target, wherein the data set is subjected to data enhancement of scale transformation, translation, overturning and rotation; and (5) complementing the depth map information by using a depth complementing algorithm, and completing the fusion of the RGB image and the depth information. And cutting and scaling the fused image to meet the input format of the model, and performing the following steps according to the step (9): 1, randomly dividing a training set and a verification set; optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient, so that the difference between the detected capture frame and a true value is minimized; and (4) taking the real image data as the input of a capture detection model to carry out capture detection, and visualizing the result.
The whole mechanical arm grabbing scene is shown in figure 1. The robot mainly comprises a mechanical arm, a parallel two-finger clamp, a depth camera, a computer, a controller, an object to be grabbed, a platform and the like. The depth camera shoots an object to be grabbed on the platform to obtain RGB images and depth information. The computer reads the RGB image and the depth information and processes the RGB image and the depth information, and a feasible grabbing frame is detected from the image information by utilizing an realized grabbing detection algorithm. And the grabbing frame is mapped to a mechanical arm coordinate system and is transmitted to the controller, and grabbing track planning and execution are carried out on the mechanical arm.
The invention mainly aims at the treatment of the visual part of the mechanical arm grabbing problem, and in the past method, five-dimensional grabbing representation is proposed and widely applied to related work in recent years. Five-dimensional grabbing is represented as describing the grab box as:
g={x,y,θ,h,w)
the five-dimensional grab is (x, y) as the center point of the grab frame, h and w are the height and width of the grab frame, respectively, and θ is its direction relative to the horizontal axis of the image.
The output of the grabbing detection model is as follows:
g={x,y,θ,h,w,Q}
where (x, y) is the center point of the grab frame, h and w are the height and width of the grab frame, respectively, θ is its direction relative to the horizontal axis of the image, Q is the grab quality score, and the probability of grab is evaluated by a value between 0 and 1, with a larger Q indicating a greater feasibility of the grab frame.
The five-dimensional representation method has smaller dimension and less calculation amount. The feasibility of such a five-dimensional representation is demonstrated in recent work, where the grab can be well represented in an image coordinate system (as shown in fig. 2), h and w being fixed and defined by the shape of the grab, respectively.
The five-dimensional representation can well represent the grab box of a planar scene, but this is limited to the fact that reasonable grab points do exist in the scene. When there is no object in the scene or no object grabbing point is feasible, the grabbing prediction result is also obtained, which is unreasonable. The five-dimensional representation method is expanded, Q is set as a grabbing quality score, the grabbing possibility is evaluated by using a numerical value between 0 and 1, and the greater Q is, the greater the feasibility of the grabbing frame is. Through certain threshold setting, snatching with poor filtering feasibility can be achieved.
The invention provides two training modes, one is directly based on RGB data to train, namely a single-mode training mode; the other training based on the RGD data is a multi-modal training mode. The RGD data is fused by replacing a B channel in an RGB image with a Depth image, and the design realizes multi-mode, provides more available information and shows good effect in experiments. Aiming at the extraction of depth information in a data set, the following formula is designed:
the design of limiting the threshold range can filter invalid information to a certain extent, and in addition, global normalization can be realized, and the normalization is not performed aiming at a single image, so that the data is more standardized. The normalized value is enlarged by 255 times, and the scale of the RGB channel value is adjusted to meet the RGD fusion condition. Due to the limitation of equipment, partial data of the point cloud is often lost, so that a complete depth map cannot be obtained. Aiming at the problem, the invention uses a deep completion method to make a corresponding mask file and uses an NS method in OpenCV to repair the mask file.
Channel replacement is realized on the depth image and RGB, and then data amplification is carried out on each picture, wherein the amplification strategy is as follows: randomly translating 0-50 pixels up and down and randomly rotating 0-360 degrees. Finally, a square area with a fixed size is cut out along the center to be used as a training image. Training images were processed as per 9: the scale of 1 is divided into a training set and a validation set, which are used for training of the model and testing of the model, respectively.
The capture detection model designed by the invention comprises a front-end feature extractor and a rear-end capture predictor, as shown in fig. 3, the deep convolutional network has strong feature extraction capability in the fields of image classification, target detection and the like, and the deep convolutional network is designed to be used as a backbone network for feature extraction. The system is mainly formed by combining and connecting a convolution module, an attention residual error module, a cross-level local module and the like. The network design has enough network depth, contains cross-level connection and has strong feature extraction capability and efficiency.
And the grab predictor part carries out the prediction of the grab frame by utilizing the convolution module and the convolution operation. Wherein the position and size information of the grabbing frame is directly regressed; the grabbing quality score is also obtained by a direct regression mode, sigmoid is carried out during final output, the output range of the grabbing quality score obtained through prediction is controlled to be 0-1, and grabbing confidence can be well represented; the angle is predicted by means of classification.
The loss function of the model of the invention is divided into three parts, where LboxesIs a box loss, LQTo capture mass fraction loss, LangleFor angle prediction loss, three outputs of the network are respectively corresponded. And optimizing the gradient of the objective function by adopting a back propagation algorithm and an optimization algorithm based on a standard gradient, so that the difference between the detected capture frame and a true value is minimized.
Ltoral=Lboxes+LQ+Langle
And obtaining a capture detection model according to training, using real image data as network input, using the { x, y, 0, h, w } five-dimensional representation of the capture frame as output, and converting the capture frame into four vertex information of the capture frame. Meanwhile, the quality scores of the grabbing frames are sequenced, the grabbing frames with the quality scores exceeding the set threshold are reserved, the grabbing frame with the maximum quality score is output, and visualization is achieved, as shown in fig. 4. And then, calibrating the internal reference and the external reference of the camera by using a Zhang-Zhengyou calibration method, and mapping the pixel points in the image to the three-dimensional coordinate information in the real world.
Compared with other algorithms, the algorithm provided by the invention has higher accuracy and efficiency and shows good effect in a real scene.
The present invention is not limited to the above embodiments, and any changes or substitutions that can be easily made by those skilled in the art within the technical scope of the present invention are also within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.