CN110238840B

CN110238840B - Mechanical arm autonomous grabbing method based on vision

Info

Publication number: CN110238840B
Application number: CN201910335507.5A
Authority: CN
Inventors: 成慧; 蔡俊浩; 苏竟成
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-04-24
Filing date: 2019-04-24
Publication date: 2021-01-29
Anticipated expiration: 2039-04-24
Also published as: CN110238840A

Abstract

The invention relates to the field of robotics technology, and more particularly, to a vision-based autonomous grasping method for a robotic arm. A corrective grasping strategy based on adversarial grasping rules is proposed, which can be used to achieve trial-and-error grasping on the simulation platform to obtain grasping samples that conform to the rules. The samples collected by this method clearly express the grasping mode against grasping rules, which is beneficial to the learning of the model. The entire data collection process does not require manual intervention, nor does it require any real data, which avoids possible problems caused by real data collection. Only a small amount of simulation data collected by this method is needed, and the trained model can be directly applied to different real grasping scenarios. The entire training process does not require domain adaptation and domain randomization, and has high accuracy and robustness.

Description

Mechanical arm autonomous grabbing method based on vision

Technical Field

The invention relates to the technical field of robots, in particular to a mechanical arm automatic grabbing method based on vision.

Background

Robot grabbing is mainly divided into two directions of an analysis method and an experience method. The analysis method generally refers to the construction of force closure grabbing based on the rules defined by four attributes of flexibility, balance, stability and dynamic certainty. This approach can generally be built into a constrained optimization problem. The empirical method is a data-driven method, and generally refers to extracting the feature representation of an object based on data, and then implementing a grabbing decision by using a set grabbing heuristic rule.

As deep learning has made tremendous progress in the field of computer vision, it has also begun to gain extensive attention and research in the field of robotics. A Learning to Grasp from 50K Tries and 700Robot homes. A robot trial-and-error grabbing mode is utilized to collect a 50000 grabbing data set, and a deep neural network is trained to realize the decision of grabbing angles. This method has 73% accuracy for unseen objects. Levine et at, left Hand-editing for a Robotic grading with Deep Learning and Large-Scale Data Collection. 800000 captured datasets were collected over two months using 6-14 robots and an evaluation model was trained using the datasets. The model can evaluate the action command according to the current scene to find out the optimal action command. This method can achieve 80% capture accuracy.

The methods have high capturing success rate, but the robots are required to capture and trial and error to acquire data. This approach is time and labor consuming and presents a significant safety hazard.

Mahler et al Dex-Net 2.0 Deep Learning to plant Robust scales with Synthetic Point clocks and analytical Grasp meters. And sampling object grabbing points in the simulation platform based on the anti-grabbing rule, and then obtaining sampling points with high robustness in a force closing mode. Based on the data obtained in the mode, a grabbing quality evaluation neural network is trained, and the grabbing success rate of the method on the countermeasures can be up to 93%. Although the method can have higher accuracy, the data size required by the training model is very large, and one reason of the method is that the acquired sample data does not clearly reflect the defined capture mode.

Disclosure of Invention

The invention provides a vision-based mechanical arm autonomous grasping method for overcoming at least one defect in the prior art, and data grasped in a simulation platform by the method is favorable for model learning.

In order to solve the technical problems, the invention adopts the technical scheme that: a mechanical arm autonomous grabbing method based on vision comprises the following steps:

s1, in a simulation environment, building an environment similar to a real scene, and collecting a global image;

s2, processing the data, wherein the preprocessed data comprise: the system comprises a global image containing the information of the whole working space, an object mask and a label graph with the same scale as the global image; the treatment process comprises the following steps: firstly, generating an object mask according to a position set of pixels where an object is located in an image, then generating a label mask according to the object mask, a capture pixel position and a capture label, and generating a label graph by using the capture position and the capture label; then discretizing the grabbing angle according to the grabbing problem definition;

s3, training a deep neural network:

(1) normalizing the input RGB images, and then synthesizing a batch;

(2) transmitting the batch of data into a full convolution neural network to obtain an output value;

(3) calculating the error between the predicted value and the label according to the cross entropy error combined with the label mask, and calculating by the following loss function:

wherein Y is a label image, M is a label mask, H and W are respectively the length and width of the label image, i, j and k are respectively index subscripts of positions in the 3-channel image, l is an index of the number of channels,

an output characteristic diagram representing the last convolutional layer;

representing a real number domain, wherein the corresponding superscript represents the dimension of the tensor;

and S4, applying the trained model to a real grabbing environment.

The invention provides a mechanical arm autonomous grabbing method based on vision, which trains an end-to-end deep neural network capable of realizing pixel-level grabbing prediction by acquiring a small amount of grabbing data in a simulation environment, and a learned model can be directly applied to a real grabbing scene. The whole process does not need to use domain self-adaption and domain randomization operation, and does not need any data collected by a real environment.

Further, the step S1 specifically includes:

s11, placing a background texture, a mechanical arm with a gripper, a camera and an object to be grabbed in a working space of a simulation environment;

s12, placing an object in a working space, selecting a position where the object exists by using a camera, recording image information, a pixel position corresponding to a grabbing point, a mask of the object in the image and a grabbing angle, and then randomly selecting an angle to allow the mechanical arm to perform trial-and-error grabbing;

s13, judging whether the grabbing is successful, and if the grabbing is failed, directly storing the image I, the position set C of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the grabbing failed label l; if the grabbing is successful, the global image I 'and the corresponding position set C' of the pixel where the object is located in the image are recorded again, and then the image I ', the position set C' of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l which is successfully grabbed are stored.

Further, the definition of the grabbing problem comprises: defining the vertical plane grasp as g ═ (p, ω, η), where p ═ x, y, z denotes the position of the grasp point in cartesian coordinates, ω ∈ [0,2 π) denotes the rotation angle of the terminal,

is a 3-dimensional one-bit effective code used for representing the grabbing function; the grabbing function is divided into three types, namely, grippable function, non-grippable function and background function; when projected into image space, capture at image I may be represented as

Wherein

Indicating the position of the grasp in the image,

representing a discrete grabbing angle; each pixel in the image may define a capture function, so the entire capture function graph may be represented as:

wherein

A capture function graph for an image at a given ith angle; in the figure, 3 channels respectively represent three categories of graspable, non-graspable and background; from each grab function graph C_iIn the first channel

And are combined together to form

Representing the real domain, the corresponding superscript represents the dimension of the tensor.

Further, the most robust grab point is obtained by solving the following formula:

i^*,h^*,w^*＝argmax_i,h,wG(i,h,w)

where G (i, h, w) represents the confidence of the graspable function in the rotation angle and image position. (h)^*,w^*) For the position to be reached by the robot arm terminal in image space, i^*Indicating terminal rotation

And then grabbing is performed.

Further, during the training process, a parameterized equation f is defined_θAnd realizing the mapping of the image to the pixel level of the grabbing function graph, wherein the mapping can be expressed as:

in the formula (I), the compound is shown in the specification,

for image I rotate

The image after the degree of the image is,

is composed of

A corresponding grabbing function diagram; f. of_θImplemented with a deep neural network; in conjunction with the loss function, the overall training objective may be defined by the following equation:

wherein

A label graph is shown.

Further, considering a scene in which only one object is placed in the working space, c1 and c2 are defined as contact points of two fingers of the gripper and the object, n1 and n2 are corresponding normal vectors thereof, and g is defined as a grabbing direction of the gripper in the image space, wherein c1, c2, n1, n2,

By the above definition, it is possible to obtain:

wherein, | | · | | represents norm operation;

defining a fetch operation as a resistive fetch, when it satisfies the following condition:

wherein theta is₁And theta₂The non-negative values tend to be 0 and pi respectively, and represent the gripping direction and the threshold value of an included angle between the normal vectors of the surfaces of two contact points contacted with the object; wherein ω is₁And ω₂Two contact points for gripping direction and contact with objectThe included angle of the surface normal vector; when the gripper gripping direction is parallel to the normal vector of the contact point, the grip is defined as a stable counter grip.

All data are collected in the simulation platform, any real data are not needed, and the problem possibly caused by collecting the data in the real environment is avoided. In the simulation platform, the confrontation grabbing rules are added, so that the acquired data can effectively reflect the corresponding grabbing mode, and the trained model can be directly applied to a real grabbing scene only by a very small amount of grabbing data. The invention realizes the capture function prediction of the end-to-end pixel level by using the full-volume machine neural network. Each output pixel can capture the global information of the input image, which enables the model to learn more efficiently and make more accurate predictions.

Further, the step S4 includes:

s41, acquiring an RGB (red, green and blue) image and a depth image of a working space by using a camera;

s42, carrying out normalization processing on the RGB images, and rotating the RGB images by 16 angles to transmit the RGB images into the model to obtain 16 capture function images;

s43, according to the definition of the grabbing problem, the first channel of each functional diagram is taken and combined, the position corresponding to the maximum value is obtained, and the optimal grabbing position and grabbing angle in the image space can be obtained;

and S44, mapping the obtained image position to a 3-dimensional space, solving a mechanical arm control command according to inverse kinematics, rotating the tail end according to the grabbing angle after the image position reaches the position right above the object, and judging the descending height of the mechanical arm according to the collected depth map to avoid collision.

Further, the step S42 specifically includes: the image input into the full convolution neural network model comprises a global image of the whole space, firstly, Resnet50 is used as an encoder to extract features, then, a four-layer bilinear difference value and convolution sampling module is used, and finally, a 5x5 convolution is used for obtaining an input through-scale grabbing function diagram.

Compared with the prior art, the beneficial effects are:

1. the invention provides a corrective grabbing strategy based on an antagonistic grabbing rule, and trial and error grabbing can be carried out on a simulation platform by utilizing the strategy to obtain a grabbing sample according with the rule. The sample acquired by the method clearly expresses the capture mode of the anti-capture rule, and is beneficial to the learning of the model. The whole data acquisition process does not need manual intervention or any real data, and the problem possibly brought by real data acquisition is avoided.

2. Only a small amount of simulation data acquired by the method is needed, and the trained model can be directly applied to different real capturing scenes. The whole training process does not need domain self-adaptation and domain randomization operation, and the accuracy and the robustness are high.

3. A full convolution depth neural network is designed, the network inputs images containing the information of the whole working space, and the capturing function of each pixel point is output and predicted. The network structure of global input and pixel level prediction can learn corresponding grabbing modes faster and better.

Drawings

FIG. 1 is a diagram illustrating parameters defined in anti-snatching rules in a simulator according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a full convolution neural network of the present invention.

Detailed Description

The drawings are for illustration purposes only and are not to be construed as limiting the invention; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the invention.

Example 1:

defining the grabbing problem: defining the vertical plane grasp as g ═ (p, ω, η), where p ═ x, y, z denotes the position of the grasp point in cartesian coordinates, ω ∈ [0,2 π) denotes the rotation angle of the terminal,

is a 3-dimensional one-bit efficient code used to represent the grab function. The grabbing function is divided into three types, i.e. grippable function, non-grippable function and background function. When projected into image space, capture at image I may be represented as

Wherein

Indicating the position of the grasp in the image,

representing discrete grasping angles. Discretization can reduce the complexity of the learning process. Thus, each pixel in the image may define a capture function, so the entire capture function graph may be represented as:

wherein

The capture function graph of the image at the given ith angle is obtained. In the figure, 3 channels respectively represent three categories of graspable, non-graspable and background. From each grab function graph C_iIn the first channel

(i.e., snatchable functional channel) and combined together

Thus, the most robust grab point can be obtained by solving the following equation:

i^*,h^*,w^*＝argmax_i,h,wG(i,h,w)

wherein G (i, h, w) represents the rotation angle and the image positionConfidence of the grab function. (h)^*,w^*) For the position to be reached by the robot arm terminal in image space, i^*Indicating terminal rotation

And then grabbing is performed.

During the training process, a parameterized equation f is defined_θAnd realizing the mapping of the image to the pixel level of the grabbing function graph, wherein the mapping can be expressed as:

for image I rotate

The image after the degree of the image is,

is composed of

And (5) corresponding grabbing function diagrams.

f_θMay be implemented with a deep neural network; for example, learning is performed by using a gradient descent method to obtain an expression of a function, data is input into a neural network to obtain a prediction output, the prediction output is compared with a real label to obtain an error, the error is propagated backwards to obtain a gradient of each parameter in the neural network, and finally the parameters are updated by using the gradients to make the output of the neural network closer to the real label, so that a specific expression of the function is obtained by learning.

In conjunction with the loss function, the overall training objective may be defined by the following equation:

wherein

A label graph is shown.

Collecting simulation data: the invention defines the rule of object fighting grabbing in the image space. Consider a scene in which only one object is placed in the workspace. C1 and c2 are defined as the contact points of the two fingers and the object, n1 and n2 are their corresponding normal vectors, and g is the direction of capture of the gripper in image space. c1, c2, n1, n2,

As shown in fig. 1; by the above definition, it is possible to obtain:

| | · | | represents a norm operation. The invention defines a grabbing operation as a counter-grabbing when it satisfies the following condition:

wherein theta is₁And theta₂The non-negative values tend to be 0 and pi respectively, and represent the gripping direction and the threshold value of an included angle between the normal vectors of the surfaces of two contact points contacted with the object; omega₁And ω₂The included angle between the grabbing direction and the normal vector of the surfaces of two contact points contacted with the object is shown. In general, a gripper is defined as a stable pair when the direction of the gripper's grasp is parallel to the normal vector of the point of contactAnd (4) resisting grabbing.

In a practical implementation, the present invention uses a corrective grab strategy to achieve the collection of samples that satisfy the resistive grab rule. Firstly, a grabbing angle and a pixel position containing an object are randomly selected, and then the information of the whole working space is recorded by a camera. And then controlling the mechanical arm to perform trial and error grabbing, and if grabbing fails, storing the working space image I, the grabbing pixel position p, the set C of all pixel positions occupied by the object in the image, the grabbing angle psi and the label l. If the grabbing is successful, the position of the object is changed due to the contact of the gripper and the object, the grabbing direction of the gripper is approximately parallel to the normal vector of the contact point due to the correction change, the requirement for resisting the grabbing rule is met, at the moment, the camera records the corrected image I 'again, the pixel position C' of the object is obtained again, and then the image, the pixel position of the grabbing point, the set of all pixel positions occupied by the object in the image, the grabbing angle and the label are stored.

Defining a network structure:

the network structure is shown in fig. 2. The method adopts a full-volume machine neural network, inputs the global image containing the whole working space, firstly uses Resnet50 as an encoder to extract features, then uses an up-sampling module with four layers of bilinear interpolation and convolution, and optimally uses a 5x5 convolution to obtain a grabbing function graph with the input through scale.

Defining a loss function:

because most pixels in an image belong to the background class and the graspable and non-graspable labels are very sparse, training directly with such data can be very inefficient. The invention therefore proposes to calculate the loss function in combination with a label mask. For pixels belonging to the object but not subjected to trial-and-error capture, the value of the pixel at the position corresponding to the label mask is set as

For other pixels, the value of the position corresponding to the label mask is set as

Is provided with

The output characteristic diagram of the last convolutional layer is shown. The corresponding loss function is therefore:

indicating the label graph corresponding to the sample, H and W are the length and width of the label graph respectively, i, j and k are index subscripts of the position in the 3-channel image respectively, l is the index of the channel number,

an output characteristic diagram representing the last convolutional layer;

In order to reduce the influence caused by label sparsity, the invention increases the loss weight of grippable and non-grippable, and reduces the loss weight of the background. For both grippable and non-grippable labels, the position of their mask is multiplied by 120, while the background area is multiplied by 0.1.

The method comprises the following specific implementation steps:

step 1: in the simulation environment, an environment similar to a real scene is built.

Step 1.1: a background texture, a mechanical arm with a gripper, a camera and an object to be grabbed are placed in a working space of the simulation environment.

Step 1.2: the method comprises the steps of placing an object in a working space, selecting a position where the object exists by using a camera, recording image information, a pixel position corresponding to a grabbing point, a mask of the object in an image and a grabbing angle, and then randomly selecting an angle to enable a mechanical arm to perform trial-and-error grabbing.

Step 1.3: and judging whether the grabbing is successful, if the grabbing is failed, directly storing the image I, the position set C of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l of the grabbing failure. If the grabbing is successful, the global image I 'and the corresponding position set C' of the pixel where the object is located in the image are recorded again, and then the image I ', the position set C' of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle psi and the label l which is successfully grabbed are stored. The acquired global image is the global image defined by the grabbing problem in the invention content, and the grabbing angle and the grabbing position are also defined in the image space.

Step 2: the data is pre-processed.

Step 2.1: generating an object mask according to the position set of the pixel where the object is located in the image, generating a label mask according to the object mask, the pixel grabbing position and the label grabbing position, and generating a label image by using the grabbing position and the label grabbing position. For the label mask, the weights belonging to the grippable and non-grippable regions are increased, and the weight of the background is decreased.

Step 2.2: and discretizing the grabbing angle according to the problem definition. In this step, the image is rotated by 16 degrees, and the corresponding label and label mask are also rotated by 16 degrees, because only horizontal grabbing is considered, only data in which the grabbing direction is parallel to the horizontal direction after rotation is retained.

Step 2.3: the preprocessed data includes: the system comprises a global image containing the information of the whole working space, an object mask and a label graph with the same scale as the global image.

And step 3: and training the deep neural network.

Step 3.1: the input RGB maps are normalized and then a batch (batch) is synthesized.

Step 3.2: the batch of data is transmitted to a full convolution neural network defined in the invention content to obtain an output value.

Step 3.3: and calculating the error between the predicted value and the label according to the cross entropy error combined with the label mask, wherein the calculation formula is as follows:

an output characteristic diagram representing the last convolutional layer;

And 4, step 4: and applying the trained model to a real grabbing environment.

Step 4.1: and acquiring an RGB (red, green and blue) map and a depth map of the working space by using the camera.

Step 4.2: and (3) carrying out normalization processing on the RGB image, and rotating 16 angles to transmit into the full convolution neural network model to obtain 16 capture function images.

Step 4.3: according to the definition of the grabbing problem, the first channel of each functional diagram is taken and combined, the position corresponding to the maximum value is obtained, and the optimal grabbing position and grabbing angle in the image space can be obtained.

Step 4.4: and mapping the obtained image position to a 3-dimensional space, solving a mechanical arm control command according to inverse kinematics, rotating the tail end according to a grabbing angle after the tail end reaches the position right above the object, and judging the descending height of the mechanical arm according to the acquired depth map to avoid collision.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. a vision-based robotic arm autonomous grasping method, is characterized in that, comprises the following steps:

S1. In the simulation environment, build an environment similar to the real scene, and collect global images;

S2. Process the data. The preprocessed data includes: a global image containing the information of the entire workspace, an object mask, and a label map of the same scale as the global image; the processing process includes: first, according to the location set of the pixels where the object is located in the image Generate an object mask, and then generate a label mask according to the object mask, grasping pixel position and grasping label, and generate a label map with the grasping position and grasping label; Then according to the definition of grasping problem, the grasping angle is discretized change;

The definition of grasping problem includes: defining vertical plane grasping as g=(p,ω,η), where p=(x,y,z) represents the position of grasping point in Cartesian coordinates, ω∈[0 ,2π) represents the rotation angle of the terminal,

It is a 3-dimensional one-bit effective code, used to represent the grabbing function; the grabbing function is divided into three types: graspable, non-grabbable and background; when projected into the image space, the grasping in the image I can be expressed as

in

represents the grab position in the image,

Represents discrete grab angles; each pixel in the image can define a grab function, so the entire grab function graph can be expressed as:

in

is the grasping function diagram of the image given the i-th angle; the three channels in this figure represent three categories of graspable, ungraspable and background respectively; extract the first one from each grasping function diagram C _i aisle

and combine together to form

S3. Train a deep neural network:

(1) Normalize the input RGB image, and then synthesize a batch;

(2) Passing the batch of data into the fully convolutional neural network to obtain the output value;

(3) Calculate the error between the predicted value and the label according to the cross-entropy error combined with the label mask, and calculate it by the following loss function:

where Y is the label map, M is the label mask, H and W are the length and width of the label map, respectively, i, j and k are the index subscripts of the positions in the 3-channel image, and l is the index of the number of channels,

represents the output feature map of the last convolutional layer;

Represents the real number field, and the corresponding superscript represents the dimension size of the tensor;

S4. Apply the trained model to the real grasping environment.

2. a kind of vision-based robotic arm autonomous grasping method according to claim 1, is characterized in that, described S1 step specifically comprises:

S11. Place a background texture, a robotic arm with a gripper, a camera and an object to be grasped in the workspace of the simulation environment;

S12. Place the object in the workspace, use the camera to select a position where the object exists, record the image information, the pixel position corresponding to the grasping point, the mask of the object in the image and the grasping angle, and then randomly select an angle to let the robotic arm Do a trial-and-error crawl;

S13. determine whether the grabbing is successful, if grabbing fails, then directly save the image I, the position set C of the pixel where the object is located in the image, the pixel position p corresponding to the grabbing point, the grabbing angle ψ and the label l of the grabbing failure; If the capture is successful, re-record the global image I' and the position set C' of the pixel where the object is located in the corresponding image, and then record the image I', the position set C' of the pixel where the object is located in the image, and the pixel position corresponding to the grab point. p, the grasping angle ψ and the successfully grasped label l are saved.

3. a kind of vision-based robotic arm autonomous grasping method according to claim 2, is characterized in that, the most robust grasping point obtains by solving following formula:

i ^* ,h ^* ,w ^* =argmax _i,h,w G(i,h,w)

where G(i,h,w) represents the confidence of the graspable function under the rotation angle and image position; (h ^* ,w ^* ) is the position to be reached by the robotic arm terminal in the image space, and i ^* represents the rotation of the terminal

Then perform the fetch.

4. A vision-based robotic arm autonomous grasping method according to claim 3, characterized in that, in the training process, a parameterized equation f _θ is defined to achieve pixel-level image capture between the image and the grasping function map. map, which can be expressed as:

In the formula,

rotated for image I

image after degrees,

for

Corresponding grasping function map; f _θ is implemented by a deep neural network; combined with the loss function, the entire training target can be defined by the following formula:

in

Represents a label map.

5. A vision-based robotic arm autonomous grasping method according to claim 4, characterized in that, considering a scene where only one object is placed in the workspace, c1 and c2 are defined as the difference between the two-finger gripper and the object. Contact points, n1 and n2 are their corresponding normal vectors, and g is the grasping direction of the gripper in the image space, where c1, c2, n1, n2,

From the above definition, we can get:

Among them, || · || represents the norm operation;

Define a grab operation as an adversarial grab, which is an adversarial grab when it meets the following conditions:

where θ ₁ and θ ₂ are non-negative values tending to 0 and π, respectively, representing the threshold of the angle between the grasping direction and the surface normal vectors of the two contact points in contact with the object; where ω ₁ and ω ₂ are the grasping direction and the The angle between the surface normal vectors of the two contact points in contact with the object; when the gripping direction of the gripper is parallel to the normal vector of the contact point, the grip is defined as a stable confrontation grip.

6. a kind of vision-based robotic arm autonomous grasping method according to claim 5, is characterized in that, described S4 step comprises:

S41. Use the camera to obtain the RGB map and depth map of the working space;

S42. Normalize the RGB image, and rotate it by 16 angles to the fully convolutional neural network model to obtain 16 grasping function maps;

S43. According to the definition of the grasping problem, take the first channel of each functional map and combine them, find the position corresponding to the maximum value, and obtain the best grasping position and grasping position in the image space. take the angle;

S44. Map the obtained image position into the 3-dimensional space, and then solve the control command of the manipulator according to inverse kinematics. After reaching directly above the object, rotate the end according to the grasping angle, and judge the drop height of the manipulator according to the collected depth map. Avoid collisions.

7. a kind of vision-based robotic arm autonomous grasping method according to claim 6, is characterized in that, described S42 step specifically comprises: the image of inputting fully convolutional neural network model comprises the global image of the whole space, first Use Resnet50 as the encoder to extract features, then use a four-layer bilinear difference plus convolution upsampling module, and finally use a 5×5 convolution to obtain the input pass-scale capture function map.