Disclosure of Invention
The invention provides a vision-based mechanical arm autonomous grasping method for overcoming at least one defect in the prior art, and data grasped in a simulation platform by the method is favorable for model learning.
In order to solve the technical problems, the invention adopts the technical scheme that: a mechanical arm autonomous grabbing method based on vision comprises the following steps:
s1, in a simulation environment, building an environment similar to a real scene, and collecting a global image;
s2, processing the data, wherein the preprocessed data comprise: the system comprises a global image containing the information of the whole working space, an object mask and a label graph with the same scale as the global image; the treatment process comprises the following steps: firstly, generating an object mask according to a position set of pixels where an object is located in an image, then generating a label mask according to the object mask, a capture pixel position and a capture label, and generating a label graph by using the capture position and the capture label; then discretizing the grabbing angle according to the grabbing problem definition;
s3, training a deep neural network:
(1) normalizing the input RGB images, and then synthesizing a batch;
(2) transmitting the batch of data into a full convolution neural network to obtain an output value;
(3) calculating the error between the predicted value and the label according to the cross entropy error combined with the label mask, and calculating by the following loss function:
wherein Y is a label image, M is a label mask, H and W are respectively the length and width of the label image, i, j and k are respectively index subscripts of positions in the 3-channel image, l is an index of the number of channels,
an output characteristic diagram representing the last convolutional layer;
representing a real number domain, wherein the corresponding superscript represents the dimension of the tensor;
and S4, applying the trained model to a real grabbing environment.
The invention provides a mechanical arm autonomous grabbing method based on vision, which trains an end-to-end deep neural network capable of realizing pixel-level grabbing prediction by acquiring a small amount of grabbing data in a simulation environment, and a learned model can be directly applied to a real grabbing scene. The whole process does not need to use domain self-adaption and domain randomization operation, and does not need any data collected by a real environment.
Further, the step S1 specifically includes:
s11, placing a background texture, a mechanical arm with a gripper, a camera and an object to be grabbed in a working space of a simulation environment;
s12, placing an object in a working space, selecting a position where the object exists by using a camera, recording image information, a pixel position corresponding to a grabbing point, a mask of the object in the image and a grabbing angle, and then randomly selecting an angle to allow the mechanical arm to perform trial-and-error grabbing;
s13, judging whether the grabbing is successful, and if the grabbing is failed, directly storing the image I, the position set C of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the grabbing failed label l; if the grabbing is successful, the global image I 'and the corresponding position set C' of the pixel where the object is located in the image are recorded again, and then the image I ', the position set C' of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l which is successfully grabbed are stored.
Further, the definition of the grabbing problem comprises: defining the vertical plane grasp as g ═ (p, ω, η), where p ═ x, y, z denotes the position of the grasp point in cartesian coordinates, ω ∈ [0,2 π) denotes the rotation angle of the terminal,
is a 3-dimensional one-bit effective code used for representing the grabbing function; the grabbing function is divided into three types, namely, grippable function, non-grippable function and background function; when projected into image space, capture at image I may be represented as
Wherein
Indicating the position of the grasp in the image,
representing a discrete grabbing angle; each pixel in the image may define a capture function, so the entire capture function graph may be represented as:
wherein
A capture function graph for an image at a given ith angle; in the figure, 3 channels respectively represent three categories of graspable, non-graspable and background; from each grab function graph C
iIn the first channel
And are combined together to form
Representing the real domain, the corresponding superscript represents the dimension of the tensor.
Further, the most robust grab point is obtained by solving the following formula:
i*,h*,w*=argmaxi,h,wG(i,h,w)
where G (i, h, w) represents the confidence of the graspable function in the rotation angle and image position. (h)
*,w
*) For the position to be reached by the robot arm terminal in image space, i
*Indicating terminal rotation
And then grabbing is performed.
Further, during the training process, a parameterized equation f is definedθAnd realizing the mapping of the image to the pixel level of the grabbing function graph, wherein the mapping can be expressed as:
in the formula (I), the compound is shown in the specification,
for image I rotate
The image after the degree of the image is,
is composed of
A corresponding grabbing function diagram; f. of
θImplemented with a deep neural network; in conjunction with the loss function, the overall training objective may be defined by the following equation:
wherein
A label graph is shown.
Further, considering a scene in which only one object is placed in the working space, c1 and c2 are defined as contact points of two fingers of the gripper and the object, n1 and n2 are corresponding normal vectors thereof, and g is defined as a grabbing direction of the gripper in the image space, wherein c1, c2, n1, n2,
By the above definition, it is possible to obtain:
wherein, | | · | | represents norm operation;
defining a fetch operation as a resistive fetch, when it satisfies the following condition:
wherein theta is1And theta2The non-negative values tend to be 0 and pi respectively, and represent the gripping direction and the threshold value of an included angle between the normal vectors of the surfaces of two contact points contacted with the object; wherein ω is1And ω2Two contact points for gripping direction and contact with objectThe included angle of the surface normal vector; when the gripper gripping direction is parallel to the normal vector of the contact point, the grip is defined as a stable counter grip.
All data are collected in the simulation platform, any real data are not needed, and the problem possibly caused by collecting the data in the real environment is avoided. In the simulation platform, the confrontation grabbing rules are added, so that the acquired data can effectively reflect the corresponding grabbing mode, and the trained model can be directly applied to a real grabbing scene only by a very small amount of grabbing data. The invention realizes the capture function prediction of the end-to-end pixel level by using the full-volume machine neural network. Each output pixel can capture the global information of the input image, which enables the model to learn more efficiently and make more accurate predictions.
Further, the step S4 includes:
s41, acquiring an RGB (red, green and blue) image and a depth image of a working space by using a camera;
s42, carrying out normalization processing on the RGB images, and rotating the RGB images by 16 angles to transmit the RGB images into the model to obtain 16 capture function images;
s43, according to the definition of the grabbing problem, the first channel of each functional diagram is taken and combined, the position corresponding to the maximum value is obtained, and the optimal grabbing position and grabbing angle in the image space can be obtained;
and S44, mapping the obtained image position to a 3-dimensional space, solving a mechanical arm control command according to inverse kinematics, rotating the tail end according to the grabbing angle after the image position reaches the position right above the object, and judging the descending height of the mechanical arm according to the collected depth map to avoid collision.
Further, the step S42 specifically includes: the image input into the full convolution neural network model comprises a global image of the whole space, firstly, Resnet50 is used as an encoder to extract features, then, a four-layer bilinear difference value and convolution sampling module is used, and finally, a 5x5 convolution is used for obtaining an input through-scale grabbing function diagram.
Compared with the prior art, the beneficial effects are:
1. the invention provides a corrective grabbing strategy based on an antagonistic grabbing rule, and trial and error grabbing can be carried out on a simulation platform by utilizing the strategy to obtain a grabbing sample according with the rule. The sample acquired by the method clearly expresses the capture mode of the anti-capture rule, and is beneficial to the learning of the model. The whole data acquisition process does not need manual intervention or any real data, and the problem possibly brought by real data acquisition is avoided.
2. Only a small amount of simulation data acquired by the method is needed, and the trained model can be directly applied to different real capturing scenes. The whole training process does not need domain self-adaptation and domain randomization operation, and the accuracy and the robustness are high.
3. A full convolution depth neural network is designed, the network inputs images containing the information of the whole working space, and the capturing function of each pixel point is output and predicted. The network structure of global input and pixel level prediction can learn corresponding grabbing modes faster and better.
Detailed Description
The drawings are for illustration purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.
Example 1:
defining the grabbing problem: defining the vertical plane grasp as g ═ (p, ω, η), where p ═ x, y, z denotes the position of the grasp point in cartesian coordinates, ω ∈ [0,2 π) denotes the rotation angle of the terminal,
is a 3-dimensional one-bit efficient code used to represent the grab function. The grabbing function is divided into three types, i.e. grippable function, non-grippable function and background function. When projected into image space, capture at image I may be represented as
Wherein
Indicating the position of the grasp in the image,
representing discrete grasping angles. Discretization can reduce the complexity of the learning process. Thus, each pixel in the image may define a capture function, so the entire capture function graph may be represented as:
wherein
The capture function graph of the image at the given ith angle is obtained. In the figure, 3 channels respectively represent three categories of graspable, non-graspable and background. From each grab function graph C
iIn the first channel
(i.e., snatchable functional channel) and combined together
Thus, the most robust grab point can be obtained by solving the following equation:
i*,h*,w*=argmaxi,h,wG(i,h,w)
wherein G (i, h, w) represents the rotation angle and the image positionConfidence of the grab function. (h)
*,w
*) For the position to be reached by the robot arm terminal in image space, i
*Indicating terminal rotation
And then grabbing is performed.
During the training process, a parameterized equation f is definedθAnd realizing the mapping of the image to the pixel level of the grabbing function graph, wherein the mapping can be expressed as:
for image I rotate
The image after the degree of the image is,
is composed of
And (5) corresponding grabbing function diagrams.
fθMay be implemented with a deep neural network; for example, learning is performed by using a gradient descent method to obtain an expression of a function, data is input into a neural network to obtain a prediction output, the prediction output is compared with a real label to obtain an error, the error is propagated backwards to obtain a gradient of each parameter in the neural network, and finally the parameters are updated by using the gradients to make the output of the neural network closer to the real label, so that a specific expression of the function is obtained by learning.
In conjunction with the loss function, the overall training objective may be defined by the following equation:
wherein
A label graph is shown.
Collecting simulation data: the invention defines the rule of object fighting grabbing in the image space. Consider a scene in which only one object is placed in the workspace. C1 and c2 are defined as the contact points of the two fingers and the object, n1 and n2 are their corresponding normal vectors, and g is the direction of capture of the gripper in image space. c1, c2, n1, n2,
As shown in fig. 1; by the above definition, it is possible to obtain:
| | · | | represents a norm operation. The invention defines a grabbing operation as a counter-grabbing when it satisfies the following condition:
wherein theta is1And theta2The non-negative values tend to be 0 and pi respectively, and represent the gripping direction and the threshold value of an included angle between the normal vectors of the surfaces of two contact points contacted with the object; omega1And ω2The included angle between the grabbing direction and the normal vector of the surfaces of two contact points contacted with the object is shown. In general, a gripper is defined as a stable pair when the direction of the gripper's grasp is parallel to the normal vector of the point of contactAnd (4) resisting grabbing.
In a practical implementation, the present invention uses a corrective grab strategy to achieve the collection of samples that satisfy the resistive grab rule. Firstly, a grabbing angle and a pixel position containing an object are randomly selected, and then the information of the whole working space is recorded by a camera. And then controlling the mechanical arm to perform trial and error grabbing, and if grabbing fails, storing the working space image I, the grabbing pixel position p, the set C of all pixel positions occupied by the object in the image, the grabbing angle psi and the label l. If the grabbing is successful, the position of the object is changed due to the contact of the gripper and the object, the grabbing direction of the gripper is approximately parallel to the normal vector of the contact point due to the correction change, the requirement for resisting the grabbing rule is met, at the moment, the camera records the corrected image I 'again, the pixel position C' of the object is obtained again, and then the image, the pixel position of the grabbing point, the set of all pixel positions occupied by the object in the image, the grabbing angle and the label are stored.
Defining a network structure:
the network structure is shown in fig. 2. The method adopts a full-volume machine neural network, inputs the global image containing the whole working space, firstly uses Resnet50 as an encoder to extract features, then uses an up-sampling module with four layers of bilinear interpolation and convolution, and optimally uses a 5x5 convolution to obtain a grabbing function graph with the input through scale.
Defining a loss function:
because most pixels in an image belong to the background class and the graspable and non-graspable labels are very sparse, training directly with such data can be very inefficient. The invention therefore proposes to calculate the loss function in combination with a label mask. For pixels belonging to the object but not subjected to trial-and-error capture, the value of the pixel at the position corresponding to the label mask is set as
For other pixels, the value of the position corresponding to the label mask is set as
Is provided with
The output characteristic diagram of the last convolutional layer is shown. The corresponding loss function is therefore:
indicating the label graph corresponding to the sample, H and W are the length and width of the label graph respectively, i, j and k are index subscripts of the position in the 3-channel image respectively, l is the index of the channel number,
an output characteristic diagram representing the last convolutional layer;
representing the real domain, the corresponding superscript represents the dimension of the tensor.
In order to reduce the influence caused by label sparsity, the invention increases the loss weight of grippable and non-grippable, and reduces the loss weight of the background. For both grippable and non-grippable labels, the position of their mask is multiplied by 120, while the background area is multiplied by 0.1.
The method comprises the following specific implementation steps:
step 1: in the simulation environment, an environment similar to a real scene is built.
Step 1.1: a background texture, a mechanical arm with a gripper, a camera and an object to be grabbed are placed in a working space of the simulation environment.
Step 1.2: the method comprises the steps of placing an object in a working space, selecting a position where the object exists by using a camera, recording image information, a pixel position corresponding to a grabbing point, a mask of the object in an image and a grabbing angle, and then randomly selecting an angle to enable a mechanical arm to perform trial-and-error grabbing.
Step 1.3: and judging whether the grabbing is successful, if the grabbing is failed, directly storing the image I, the position set C of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l of the grabbing failure. If the grabbing is successful, the global image I 'and the corresponding position set C' of the pixel where the object is located in the image are recorded again, and then the image I ', the position set C' of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l which is successfully grabbed are stored. The acquired global image is the global image defined by the grabbing problem in the invention content, and the grabbing angle and the grabbing position are also defined in the image space.
Step 2: the data is pre-processed.
Step 2.1: generating an object mask according to the position set of the pixel where the object is located in the image, generating a label mask according to the object mask, the pixel grabbing position and the label grabbing position, and generating a label image by using the grabbing position and the label grabbing position. For the label mask, the weights belonging to the grippable and non-grippable regions are increased, and the weight of the background is decreased.
Step 2.2: and discretizing the grabbing angle according to the problem definition. In this step, the image is rotated by 16 degrees, and the corresponding label and label mask are also rotated by 16 degrees, because only horizontal grabbing is considered, only data in which the grabbing direction is parallel to the horizontal direction after rotation is retained.
Step 2.3: the preprocessed data includes: the system comprises a global image containing the information of the whole working space, an object mask and a label graph with the same scale as the global image.
And step 3: and training the deep neural network.
Step 3.1: the input RGB maps are normalized and then a batch (batch) is synthesized.
Step 3.2: the batch of data is transmitted to a full convolution neural network defined in the invention content to obtain an output value.
Step 3.3: and calculating the error between the predicted value and the label according to the cross entropy error combined with the label mask, wherein the calculation formula is as follows:
wherein Y is a label image, M is a label mask, H and W are respectively the length and width of the label image, i, j and k are respectively index subscripts of positions in the 3-channel image, l is an index of the number of channels,
an output characteristic diagram representing the last convolutional layer;
representing the real domain, the corresponding superscript represents the dimension of the tensor.
And 4, step 4: and applying the trained model to a real grabbing environment.
Step 4.1: and acquiring an RGB (red, green and blue) map and a depth map of the working space by using the camera.
Step 4.2: and (3) carrying out normalization processing on the RGB image, and rotating 16 angles to transmit into the full convolution neural network model to obtain 16 capture function images.
Step 4.3: according to the definition of the grabbing problem, the first channel of each functional diagram is taken and combined, the position corresponding to the maximum value is obtained, and the optimal grabbing position and grabbing angle in the image space can be obtained.
Step 4.4: and mapping the obtained image position to a 3-dimensional space, solving a mechanical arm control command according to inverse kinematics, rotating the tail end according to a grabbing angle after the tail end reaches the position right above the object, and judging the descending height of the mechanical arm according to the acquired depth map to avoid collision.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.