CN114004971A

CN114004971A - A 3D Object Detection Method Based on Monocular Image and Prior Information

Info

Publication number: CN114004971A
Application number: CN202111359773.5A
Authority: CN
Inventors: 周怀东; 冯蓬勃; 丑武胜; 李维娟
Original assignee: Beihang Gol Weifang Intelligent Robot Co ltd; Beihang University
Current assignee: Beihang Gol Weifang Intelligent Robot Co ltd; Beihang University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-01

Abstract

The embodiment of the invention discloses a 3D target detection method based on monocular images and prior information, which belongs to the technical field of computer image processing and comprises the following steps: s1: marking a data set of a target to be detected according to the requirement of the operation task; s2: performing model training by using the labeled data set to generate a 3D target detection model; s3: and inputting a monocular image to perform 3D target detection by using the trained 3D target detection model, predicting the position and posture information of the 3D target, and labeling the category and the 3D envelope frame of the target object. According to the method, the target object is firstly identified from a disordered environment for attitude estimation by adopting a deep learning method, so that the interference of a complex environment on the attitude estimation can be effectively reduced, and the accuracy of the attitude estimation is improved; the invention detects the target and completes all target detection and attitude estimation through one network, thereby effectively reducing the computational burden of the computer and improving the real-time performance of the algorithm operation.

Description

3D target detection method based on monocular image and prior information

Technical Field

The embodiment of the invention relates to the technical field of computer image processing, in particular to a 3D target detection method based on monocular images and prior information.

Background

With the continuous development of scientific technology, the artificial intelligence technology makes great progress and is continuously applied to various aspects of industrial production, social life, scientific research and the like, in particular to the wide application of intelligent mobile robots, by adopting the artificial intelligence technology, the robots can easily replace human beings to carry out heavy and repeated labor operation tasks completely or partially, the human beings are liberated from dangerous, low-value and physical labor, the human beings can develop intelligent competence to create and explore more meaningful and valuable unknown fields, and meanwhile, the intelligent robots can replace the human beings to complete more incompletable tasks, such as space exploration, deep sea exploration, nuclear environment exploration and the like. The intelligent robot needs to accurately operate the operation target like a human in the process of replacing the human to complete the operation task so as to avoid danger or damage to the operation target. Therefore, it is a prerequisite for accurately completing the operation task that the relative position and posture of the target are accurately known before the operation is performed on the operation target. Therefore, how to accurately estimate the pose and position of the target object has been an important research topic in the field of intelligent robots, and the conventional research method is generally limited to estimation and motion planning of the pose of a planar regular object. In recent years, with continuous breakthrough of the deep learning technology, the deep learning technology obtains huge achievement in the field of target detection and simultaneously drives continuous breakthrough of the technology in the field of target attitude estimation, the target object is subjected to feature extraction and category judgment through the deep learning technology, the accuracy of target attitude estimation can be effectively improved, the possibility is provided for intelligently identifying the accurate pose of the target object, and particularly, the method is particularly important for correctly estimating the pose of the target object under the condition that the target is shielded in the grabbing process.

The target attitude estimation, which is widely used in industrial production at present, is generally directed to simple objects moving in the same plane or objects that are still. In production practice, most of target objects needing to be operated by the robot are complex, disordered in background or irregularly placed, so that the difficulty of accurate estimation and operation of the robot on the posture of the target objects is greatly increased. The existing common target attitude estimation method is only suitable for a single object or several types of objects, the number and the types of suitable objects are less, and the requirements of most objects cannot be met; only aiming at objects which are placed independently or objects which are partially shielded, in practice, shielding among the objects is one of the most common phenomena, and the problem of mutual shielding among target objects cannot be solved; and in target attitude estimation, joint reasoning is usually carried out by combining multi-source information such as images, laser radars, depth sensors and the like, so that the mechanical arm is high in load and high in cost.

Therefore, a technical problem to be solved by those skilled in the art is how to provide a novel 3D target detection method, so that the method can effectively reduce the interference of a complex environment on the attitude estimation and improve the accuracy of the attitude estimation.

Disclosure of Invention

Therefore, the embodiment of the invention provides a 3D target detection method based on monocular images and prior information, so as to solve the problem that the robot cannot accurately estimate the posture of a target object due to the complexity of a detection environment in the prior art.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

A3D target detection method based on monocular images and prior information comprises the following steps:

s1: marking a data set of a target to be detected according to the requirement of the operation task;

s2: performing model training by using the labeled data set to generate a 3D target detection model;

s3: and inputting a monocular image to perform 3D target detection by using the trained 3D target detection model, predicting the position and posture information of the 3D target, and labeling the category and the 3D envelope frame of the target object.

Further, the step S3 specifically includes:

s301: acquiring monocular image information through a camera and inputting the monocular image information into a 3D target detection model;

s302: extracting feature information about the target to be detected in the monocular image through a ResNet-34 feature extraction network of the 3D target detection model, and inputting the feature information into a feature fusion network for feature fusion;

s303: performing further feature extraction through a convolution network based on the feature information of the target to be detected fused in the step S302, respectively inputting the feature information into a regression network and a classification network, obtaining the position of the central point of the target to be detected and the height and width of the frame through the regression network, retrieving the category of the target to be detected through the classification network, and obtaining the 3D size information of the target to be detected according to a priori size library;

s304: inputting the feature information of the target to be detected, which is fused in the step S302, into an up-sampling network for up-sampling, performing convolution operation on the up-sampled feature map through a pose convolution network to output a pose feature map, and extracting position information of 9 key points of the target to be detected according to the height and width of the center point position and the frame of the target to be detected in the step S303.

Further, in step S303, a frame prediction is performed on a deviation between the real target detection frame and the preset frame through a regression network, where the real frame is B ═ (x ═ x)^b,y^b,w^b,h^b) The preset frame is G ═ x^g,y^g,w^g,h^g) If the predicted value P of the regression network is (x)^p,y^p,w^p,h^p) Is represented as follows:

the real bounding box can be expressed as:

wherein x and y are horizontal and vertical coordinates of the feature map, the horizontal and vertical coordinates are in pixel units, w and h are the width and the height of a frame in pixel units, and s is the down-sampling multiplying power in the feature extraction network.

Further, in step S304, the key point is a local maximum point in the pose feature map, and the extraction conditions are as follows:

wherein p (u, v) is a characteristic graph pixel point, and u and v are corresponding pixel coordinates.

Further, after step S304, the method further includes the following steps:

s305: matching based on the position information of the 9 key points of the target to be detected extracted in step S304 and the 3D size information of the target to be detected in step S303, and solving the three-dimensional pose information of the target to be detected by using PNP algorithm on the 9 groups of key points that are matched, that is, the pose R (R) of the target to be detected relative to the camera is (R)_x,r_y,r_z,r_w) And position T ═ X_c,Y_c,Z_c)；

S306: and performing visual labeling in the monocular image based on the three-dimensional pose information of the target to be detected solved in the step S305.

Further, assuming that a pose transformation matrix of the camera relative to the world coordinate system is a, the three-dimensional pose of the object to be detected in the world coordinate system can be represented as:

p(X,Y,Z)＝Rp_i+T (4)

wherein R is the posture of the target to be detected relative to the camera, T is the position of the target to be detected relative to the camera, and p_iAre the coordinates of the target point in the camera coordinate system.

Further, the feature fusion network comprises four layers of convolutional neural networks, and the four layers of convolutional neural networks are sequentially connected in series.

Further, the 9 key points are 8 vertexes and 1 central point of the target to be detected.

Further, the 3D target detection model is a convolutional neural network model built by adopting the pytorech.

The embodiment of the invention has the following advantages:

according to the method, the target object is firstly identified from a disordered environment for attitude estimation by adopting a deep learning method, so that the interference of a complex environment on the attitude estimation can be effectively reduced, and the accuracy of the attitude estimation is improved; according to the invention, through deep learning network construction, a target detection model is trained by adopting a method based on a prior database, and all target detection and attitude estimation are completed through one network while the target is detected according to the category number of the target to be detected in the database, so that the operation burden of a computer can be effectively reduced, and the real-time performance of algorithm operation is improved; aiming at the problem of mutual shielding between target objects, the accurate attitude information of the target objects is deduced by coupling the features extracted by the deep learning network with the prior size information of the targets; according to the invention, the position information of the object is inferred only through the monocular image and the model information by adopting the deep learning model, so that the use of the sensor can be effectively reduced, the hardware cost of the system and the load at the tail end of the mechanical arm are reduced, and the flexibility of the robot operation is enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

FIG. 1 is a technical roadmap for the present invention;

FIG. 2 is a detailed block diagram of a feature fusion network provided by the present invention;

FIG. 3 is a schematic diagram of a coordinate system of an object to be detected according to the present invention;

FIG. 4 is a feature diagram of key points provided by the present invention;

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the related technical problems in the prior art, the embodiment of the application provides a 3D target detection method based on a monocular image and prior information, which aims to solve the problems of inaccurate attitude estimation of the existing target object and the like, and achieves the effects of reducing the interference of a complex environment on the attitude estimation and improving the accuracy of the attitude estimation. As shown in fig. 1-4, the method specifically comprises the following steps:

s1: and marking the data set of the target to be detected according to the requirement of the operation task.

S2: model training is performed using the annotated dataset, resulting in a 3D target detection model.

The 3D target detection model is a convolutional neural network model built by adopting the pytorch environment building software. Firstly, a 3D target detection network based on a ResNet34 feature extraction network is built by adopting a pyroch, pre-training is carried out on an ImageNet data set, then, a YCB data set is adopted to carry out preliminary training on the 3D target detection network, and then, a self-labeling data set is adopted to carry out fine tuning of 300epochs on the network. epochs are defined as a single training iteration of all batches in forward and backward propagation.

Step S3 specifically includes:

s301: monocular image information is acquired through a camera and input into the 3D target detection model.

S302: and extracting the characteristic information about the target to be detected in the monocular image through a ResNet-34 characteristic extraction network of the 3D target detection model, and inputting the characteristic information into a characteristic fusion network for characteristic fusion.

As shown in fig. 1, a monocular image is used as data input, a ResNet34 feature extraction network is used to perform 8-fold down-sampling on the input image, high-dimensional feature information of an object to be detected is extracted, and the high-dimensional feature information is input into a feature fusion network to perform high-dimensional feature extraction and fusion.

As shown in the detailed structure of the feature fusion network in fig. 2, the feature fusion network includes four layers of convolutional neural networks, and the four layers of convolutional neural networks are connected in series in sequence. The high-dimensional feature information extracted by the ResNet34 feature extraction network is respectively input into four layers of convolutional neural networks (layer1-layer4) which are connected in series in sequence. The layer1 takes the high-dimensional feature information extracted by the ResNet34 feature extraction network as input, and outputs 9 feature maps, namely feature maps of 8 vertexes and 1 central point, and 16 vector maps, namely position vectors of 8 vertexes and a central point, and a high-dimensional feature map. The output of more than one layer of layer2-4 and the high-dimensional feature information extracted by the ResNet34 feature extraction network are used as input, the input features are fused through a convolution network, and the same content as the layer1 is output respectively. And then respectively connecting feature maps which are output by layers 1-4 and are related to 9 key points with 16 vector maps and connecting the feature maps with high-dimensional features output by layers 4 to serve as final fused feature output.

S303: based on the feature information of the target to be detected fused in the step S302, further feature extraction is performed through a convolution network, and the feature information is respectively input to a regression network and a classification network, the position of the central point of the target to be detected and the height and width of the frame are obtained through the regression network, the category to which the target to be detected belongs is retrieved through the classification network, and the 3D size information of the target to be detected is obtained according to the prior size library.

S304: inputting the feature information of the target to be detected, which is fused in the step S302, into an up-sampling network for up-sampling, performing convolution operation on the up-sampled feature map through a pose convolution network to output a pose feature map, and extracting position information of 9 key points of the target to be detected according to the height and width of the center point position and the frame of the target to be detected in the step S303. The 9 key points are 8 vertexes and 1 central point of the target to be detected.

The feature maps after feature fusion by the feature fusion network shown in fig. 2 are input to the up-sampling network and the convolution network, respectively. And respectively carrying out up-sampling for 3 times and 2 times and 8 times by adopting Upesple operation in an up-sampling network, recovering the size of the feature map to the size of an input image, and finally carrying out convolution operation on the up-sampled feature map through an attitude convolution network to output an attitude feature map. In the convolution network, the fusion features extracted in fig. 2 are subjected to target feature extraction through convolution operation, and are respectively input into a classification network and a regression network to perform target classification detection and target frame regression.

In the target classification detection, the features output c +1 category predicted values through a classification network, wherein c is the type of target detection, and 1 is the background. The c +1 predicted values are numbers between 0 and 1 and represent confidence degrees of the targets, wherein the highest confidence degree is the category c to which the targets belong.

In the target frame regression, frame prediction is carried out on the deviation of a real target detection frame and a preset frame through a regression network, wherein the real frame is B ═ x^b,y^b,w^b,h^b) The preset frame is G ═ x^g,y^g,w^g,h^g) If the predicted value P of the regression network is (x)^p,y^p,w^p,h^p) Is represented as follows:

the real bounding box can be expressed as:

The pixel coordinates and the category of the target object to be detected in the image can be determined through the steps.

As shown in fig. 1, the attitude feature map output by the attitude convolution network is sent to a 1 × 1 convolution network for feature fusion, and 9 key points of the target object are extracted according to the fused feature information, where the key points are local maximum points in the attitude feature map, and the extraction conditions are as follows:

S305: and (3) retrieving the length l, the width w and the height h of the 3D size information of the corresponding target according to the category of the target to be detected, and establishing a coordinate system by taking the geometric center point of the target object to be detected as the origin of coordinates. And respectively calculating the coordinates (x) corresponding to the eight vertexes_i,y_i,z_i) And i is 1.. 9, and is matched with 9 pixel points extracted by the attitude convolution network one by one.

And finally, solving the 9 groups of key points by adopting a PNP algorithm, thus obtaining the posture R (R) of the target object to be detected relative to the camera_x,r_y,r_z,r_w) And position T ═ X_c,Y_c,Z_c). Assuming that the pose transformation matrix of the camera with respect to the world coordinate system is a, the pose of the target object in the world coordinate system can be expressed as:

p(X,Y,Z)＝Rp_i+T (4)

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A3D target detection method based on monocular images and prior information is characterized by comprising the following steps:

2. The method for detecting a 3D object based on a monocular image and prior information as set forth in claim 1, wherein the step S3 specifically includes:

3. The method as claimed in claim 2, wherein in step S303, the deviation between the real target detection frame and the preset frame is predicted by a regression network, wherein the real frame is B ═ (x ═ x), and the real frame is predicted by the regression network^b,y^b,w^b,h^b) The preset frame is G ═ x^g,y^g,w^g,h^g) Then the predicted value of the regression networkP＝(x^p,y^p,w^p,h^p) Is represented as follows:

the real bounding box can be expressed as:

4. The 3D object detection method based on the monocular image and the prior information as set forth in claim 2, wherein in step S304, the key point is a local maximum point in the pose feature map, and the extraction conditions are as follows:

5. The monocular image and prior information based 3D object detecting method of claim 4, further comprising the following steps after step S304:

6. The monocular image and prior information based 3D object detecting method of claim 5, wherein assuming that the pose transformation matrix of the camera with respect to the world coordinate system is A, the three-dimensional pose of the object to be detected in the world coordinate system can be represented as:

p(X,Y,Z)＝Rp_i+T (4)

7. The monocular image and prior information based 3D object detection method of claim 2, wherein the feature fusion network comprises four layers of convolutional neural networks, and the four layers of convolutional neural networks are connected in series in sequence.

8. The monocular image and prior information based 3D object detecting method of claim 2, wherein the 9 key points are 8 vertices and 1 center point of the object to be detected.

9. The monocular image and prior information based 3D object detection method of claim 1, wherein the 3D object detection model is a convolutional neural network model built by using a pytorch environment building software.