CN116343165A

CN116343165A - 3D target detection system, method, terminal equipment and storage medium

Info

Publication number: CN116343165A
Application number: CN202310132301.9A
Authority: CN
Inventors: 陈祥勇; 柯英杰; 陈卫强; 苏亮; 刘强生; 邹雪莹
Original assignee: Xiamen King Long United Automotive Industry Co Ltd
Current assignee: Xiamen King Long United Automotive Industry Co Ltd
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-06-27

Abstract

The invention discloses a 3D target detection system, a method, terminal equipment and a storage medium, and belongs to the technical field of visual detection. According to the invention, two neural network models are trained and are respectively used for detecting a far small target and a near large target, wherein the far small target is detected by a monocular ranging algorithm, the near large target is detected by a binocular ranging algorithm, and finally, the two model target detection frames, the distances, the labels and the sizes are fused, so that the state information of the three-dimensional coordinates, the sizes, the categories and the like of the obstacle can be accurately obtained, and the detection precision of the 3D target can be effectively improved.

Description

3D target detection system, method, terminal equipment and storage medium

Technical Field

The invention relates to the technical field of visual detection, in particular to a 3D target detection system, a 3D target detection method, terminal equipment and a storage medium.

Background

The visual target detection mainly has two difficulties, namely small target detection and close-range local target detection, such as pedestrian ghost probes, aiming at detection of partial target area entering the field of view, such as detection and identification of feet, hands, partial bodies, heads and the like of pedestrians. Because of the large difference between semantic information of near and far targets, small targets at far and local targets at near cannot be accurately detected at the same time.

Visual ranging mainly includes monocular ranging and binocular ranging. The monocular ranging is based on the ground parallel assumption, the target position is obtained through camera-world coordinate conversion calculation, and the distance of a distant target can be obtained, but the range accuracy is not as good as that of binocular ranging due to the sensitivity to the inner parameter and the outer parameter of a camera; the binocular ranging is based on parallax matching calculation of the left and right cameras to obtain depth information, and long-distance target distance cannot be obtained effectively due to the limitation of the base line distance of the cameras and sensitivity to light rays.

To this end, we provide a 3D object detection system, method, terminal device and storage medium.

Disclosure of Invention

The invention provides a 3D target detection system, a 3D target detection method, terminal equipment and a storage medium, which are used for overcoming the defect that the existing visual detection method is difficult to accurately acquire the state information of an obstacle.

The invention adopts the following technical scheme:

a 3D object detection system, comprising:

and the image reading module is used for reading the video images around the vehicle and acquiring the left and right parallax information of the target.

The remote target detection module is used for detecting vehicles, pedestrians, riders and other obstacles at a distance, and training the vehicles, pedestrians, riders and other obstacles with small targets by adopting a yolov5 neural network model.

The short-distance target detection module is used for detecting vehicles, pedestrians, riders and other obstacles, and training the vehicles, pedestrians, riders and other obstacles with large targets by adopting a mobilet-ssd neural network model.

And the monocular distance measuring module is used for calculating the target distances of vehicles, pedestrians, riders and other obstacles at a distance.

The binocular range module is used for calculating the target distances of nearby vehicles, pedestrians, riders and other obstacles.

And the fusion module is used for fusing the far and near target detection frames, the distance, the labels and the sizes and accurately outputting the state information of the positions, the sizes and the categories of the targets.

The image reading module comprises a middle RGB camera and left and right binocular cameras, wherein the RGB camera acquires video images around the vehicle, and the left and right binocular cameras acquire left and right parallax.

The resolution of the input image of the yolov5 neural network model is 640 x 640, and the resolution of the input image of the mobilet-ssd neural network model is 384 x 384.

Training tags for the mobilet-ssd neural network model described above also include, but are not limited to, localized targets for the foot, hand, part of the body, head of a pedestrian.

The monocular distance measuring module measures distance by adopting a camera aperture imaging principle and a coordinate conversion relation, and calculates target distances of vehicles, pedestrians, riders and other obstacles at a distance.

The binocular distance measuring module calculates depth information through the left-right parallax of the target and combining the base line distance and the focal length of the camera, and obtains the target distances of vehicles, pedestrians, riders and other obstacles at a distance.

The invention also provides a 3D target detection method, which adopts the 3D target detection system and comprises the following steps:

(1) Three cameras are arranged at the front windshield of the vehicle, each camera comprises an RGB camera positioned in the middle and two binocular cameras positioned at the left side and the right side, and the three cameras collect video images around the vehicle after the system is started;

(2) The middle RGB camera reads the video frame image and transmits the image information to the long-distance target detection module and the short-distance target detection module, and the left and right binocular cameras respectively read the left and right views and calculate the left and right parallax of the target;

(3) Remote target detection, namely, detecting target category and boundary box information of vehicles, pedestrians and riders by adopting a yolov5 neural network model, inputting image resolution, optimizing and reasoning and accelerating;

(4) Distance measurement is carried out on a remote target, a single-eye distance measurement algorithm is used for carrying out distance calculation on a yolov5 detection target, and remote target position, category and boundary frame information are output;

(5) Short-distance target detection, namely detecting target types and boundary frame information of vehicles, pedestrians and riders by adopting a mobile-ssd neural network model, inputting image resolution, optimizing and reasoning and accelerating;

(6) Distance measurement is carried out on a near target, a binocular distance algorithm is used for carrying out distance calculation on a mobilent-ssd detection target, and the near target position, the category and the boundary box information are output;

(7) Converting the resolution and the bounding box, converting the mobilet-ssd model output image into yolov5 model output image resolution, and scaling the corresponding bounding box proportionally;

(8) Converting the labels, namely converting the labels of the two model detection results into uniform labels, wherein the labels are different;

(9) Setting an image boundary and an overlapping area, wherein the parts above the boundary and the overlapping area adopt long-distance target detection results, and the parts below the boundary and the overlapping area adopt short-distance target detection results;

(10) The method comprises the steps of removing the weight of an overlapping area target, wherein the overlapping area target has a long-distance target and a short-distance target, traversing the long-distance target and the short-distance target, calculating a boundary box IOU and Euclidean distance to perform fusion and weight removal, keeping a mobilet-ssd detection result when the IOU is larger than a threshold value or the Euclidean distance is smaller than the threshold value as the same target, deleting a yolov5 detection result, keeping the detection result when the IOU is smaller than the threshold value and the Euclidean distance is larger than the threshold value as different targets;

(11) Fusing targets in the non-overlapping area, and fusing a long-distance target detection result with a short-distance detection result;

(12) Fusing the target of the overlapping region and the non-overlapping region, and fusing the detection results of the overlapping region and the non-overlapping region;

(13) And according to the fused detection result label and the position information, assigning the three-dimensional size of the target model and outputting the three-dimensional size.

The monocular ranging algorithm in the step (4) is calculated by adopting the following formula:

wherein K is called an internal reference matrix, P is called an external reference matrix, R, T is a rotation matrix and a translation matrix of a world coordinate system to a camera coordinate system respectively, fx and fy are focal lengths of relative unit pixels in the horizontal direction and the vertical direction of the camera, (u 0 and v 0) optical center coordinates of the camera, (u and v) pixel coordinates of a target point on an image, zc is a distance from the target point to the camera, and Xw, yw and Zw are distances from the target point in three directions of X, Y and Z in the world coordinate system respectively;

knowing the pixel coordinates (u, v) of the target point on the image, obtaining an internal reference matrix K and an external reference matrix P through camera calibration, and solving to obtain the distances Xw, yw and Zw between the target and the vehicle, wherein zw=0.

The binocular distance measuring algorithm in the step (6) is calculated by adopting the following formula: d=f×b/(xl-xr), where d is the target detection object depth distance, f is the camera focal length, b is the left and right monocular camera baseline distance, xl-xr is the corresponding pixel disparity value.

The IOU calculation formula is as follows:

wherein A is the bounding box of object 1 and B is the bounding box of object 2; the Euclidean distance calculation formula is as follows: />

Wherein (x) ₁ ，y ₁ ) Is the position coordinates of object 1, (x ₂ ，y ₂ ) Is the position coordinates of the object 2.

The invention also provides 3D target detection terminal equipment, which comprises a processor, a memory and a computer program stored in the memory and running on the processor, wherein the steps of the 3D target detection method are realized when the processor executes the computer program.

The present invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described 3D object detection method.

From the above description of the invention, it is clear that the invention has the following advantages over the prior art:

according to the invention, two neural network models are trained and are respectively used for detecting a far small target and a near large target, wherein the far small target is detected by a monocular ranging algorithm, the near large target is detected by a binocular ranging algorithm, and finally, the two model target detection frames, the distances, the labels and the sizes are fused, so that the state information of the three-dimensional coordinates, the sizes, the categories and the like of the obstacle can be accurately obtained, and the detection precision of the 3D target can be effectively improved.

Drawings

Fig. 1 is a block diagram of a first embodiment of the present invention.

Fig. 2 is a flowchart of a first embodiment of the present invention.

Detailed Description

A specific embodiment of the present invention will be described below with reference to fig. 1. Numerous details are set forth in the following description in order to provide a thorough understanding of the present invention, but it will be apparent to one skilled in the art that the present invention may be practiced without these details. Well-known components, methods and procedures are not described in detail.

Example 1

A 3D object detection system, referring to fig. 1, includes an image reading module 10, a long-range object detection module 20, a short-range object detection module 30, a monocular ranging module 40, a binocular ranging module 50, and a fusion module 60. Wherein:

the image reading module is used for reading the video images around the vehicle and acquiring the left and right parallax information of the target. The device comprises a middle RGB camera, a left binocular camera and a right binocular camera, wherein the RGB camera is used for acquiring video images around a vehicle, and the left binocular camera and the right binocular camera are used for acquiring left parallax and right parallax.

A remote target detection module for detecting remote vehicles, pedestrians, riders and other obstacles. The remote target detection module adopts a yolov5 neural network model, the resolution of an input image is 640 x 640, and the training label is a small target such as a vehicle, a pedestrian, a rider and other obstacles.

And the short-distance target detection module is used for detecting nearby vehicles, pedestrians, riders and other obstacles. The short-distance target detection module adopts a mobile-ssd neural network model, inputs image resolution 384 and 384, and trains the large targets such as vehicles, pedestrians, riders and other obstacles, and also comprises local targets such as feet, hands, partial bodies, heads and the like of pedestrians, but not limited to.

And the monocular distance measuring module is used for measuring distance by adopting a camera aperture imaging principle and a coordinate conversion relation, and calculating the target distances of vehicles, pedestrians, riders and other obstacles at a distance.

And the binocular distance measuring module is used for calculating depth information through the left-right parallax of the target and combining the base line distance and the focal length of the camera to obtain the target distances of vehicles, pedestrians, riders and other obstacles at a distance.

And the fusion module is used for fusing the far and near target detection frames, the distance, the labels and the sizes and accurately outputting the state information of the positions, the sizes and the categories of the targets. The fusion module comprises visual detection frame fusion, label fusion, distance fusion and size fusion, wherein the target detection frame fusion is carried out, and a small target detection frame at a far position and a large target detection frame at a near position are fused; label fusion, namely converting and fusing two target detection model labels; distance fusion, namely carrying out association fusion on a far small target distance and a near large target distance; and fusing the sizes, and fusing the acquired target sizes.

Referring to fig. 2, the present invention further provides a 3D target detection method, which adopts the detection system, and specifically includes the following steps:

(1) Three cameras are arranged at the front windshield of the vehicle, each camera comprises an RGB camera positioned in the middle and two binocular cameras positioned at the left side and the right side, and the three cameras collect video images around the vehicle after the system is started.

(2) The middle RGB camera reads the video frame image and transmits the image information to the long-distance target detection module and the short-distance target detection module, and the left and right binocular cameras respectively read the left and right views and calculate the left and right parallax of the target.

(3) Remote target detection adopts a yolov5 neural network model, inputs image resolution 640 x 640, and detects target category and bounding box information of vehicles, pedestrians and riders through optimization and reasoning acceleration.

(4) Distance measurement is carried out on a remote target, a single-eye distance measurement algorithm is used for carrying out distance calculation on a yolov5 detection target, remote target position, category and boundary box information are output, and the calculation is specifically carried out by adopting the following formula:

(5) Short-distance target detection adopts a mobilet-ssd neural network model, inputs image resolution 384 and 384, and detects target category and bounding box information of vehicles, pedestrians and riders through optimization and reasoning acceleration.

(6) And (3) ranging the near target, performing distance calculation on the mobilent-ssd detection target by using a binocular ranging algorithm, and outputting the position, the category and the bounding box information of the near target. The method is specifically calculated by adopting the following formula:

d=f×b/(xl-xr), where d is the target detection object depth distance, f is the camera focal length, b is the left and right monocular camera baseline distance, xl-xr is the corresponding pixel disparity value.

(7) And converting the resolution and the bounding box, converting the mobilet-ssd model output image into yolov5 model output image resolution, and scaling the corresponding bounding box proportionally.

(8) And converting the labels into unified labels, wherein the labels of the two model detection results are different.

(9) Setting an image boundary and an overlapping area, wherein the parts above the boundary and the overlapping area adopt long-distance target detection results, and the parts below the boundary and the overlapping area adopt short-distance target detection results.

(10) And (3) removing the weight of the target in the overlapping area, wherein the target in the overlapping area is provided with a long-distance target and a short-distance target, traversing the long-distance target and the short-distance target, calculating the boundary box IOU and the Euclidean distance to perform fusion and weight removal, keeping the detection result of the mobilet-ssd as the same target, deleting the detection result of yolov5, keeping the targets with the IOU smaller than the threshold and the Euclidean distance larger than the threshold as different targets, and keeping the detection result.

The IOU calculation formula is as follows:

(11) And fusing targets in the non-overlapping area, and fusing a long-distance target detection result with a short-distance detection result.

(12) And fusing the overlapping region and the non-overlapping region targets, and fusing the detection results of the overlapping region and the non-overlapping region.

Example two

Further, as an executable scheme, the 3D object detection terminal device may be a computing device such as a vehicle-mounted computer or a cloud server. The 3D object detection terminal device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above-described composition structure of the 3D object detection terminal device is merely an example of the 3D object detection terminal device, and does not constitute limitation of the 3D object detection terminal device, and may include more or less components than the above, or may combine some components, or different components, for example, the 3D object detection terminal device may further include an input/output device, a network access device, a bus, etc., which is not limited in the embodiment of the present invention.

Further, as an implementation, the processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASiC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the 3D object detection terminal device, and connects various parts of the entire 3D object detection terminal device using various interfaces and lines.

The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the 3D object detection terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

The above-described modules/units integrated with the 3D object detection terminal device may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth.

The foregoing is merely illustrative of specific embodiments of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modification of the present invention by using the design concept shall fall within the scope of the present invention.

Claims

1. A 3D object detection system, comprising:

the image reading module is used for reading the video images around the vehicle and acquiring the left and right parallax information of the target;

the remote target detection module is used for detecting vehicles, pedestrians, riders and other obstacles at a distance, and training the vehicles, pedestrians, riders and other obstacles with small targets by adopting a yolov5 neural network model;

the short-distance target detection module is used for detecting nearby vehicles, pedestrians, riders and other obstacles, and training vehicles, pedestrians, riders and other obstacles with large targets by adopting a mobilet-ssd neural network model;

the monocular distance measuring module is used for calculating the target distances of vehicles, pedestrians, riders and other obstacles at a distance;

the binocular range finding module is used for calculating the target distances of nearby vehicles, pedestrians, riders and other obstacles;

2. A 3D object detection system according to claim 1, wherein: the image reading module comprises a middle RGB camera and a left binocular camera and a right binocular camera, wherein the RGB camera is used for acquiring video images around a vehicle, and the left binocular camera and the right binocular camera are used for acquiring left parallax and right parallax.

3. A 3D object detection system according to claim 1, wherein: the resolution of the input image of the yolov5 neural network model is 640 x 640, and the resolution of the input image of the mobilet-ssd neural network model is 384 x 384.

4. A 3D object detection system according to claim 1, wherein: the training tags of the mobilet-ssd neural network model also include, but are not limited to, the local targets of the foot, hand, part of the body, head of a pedestrian.

5. A 3D object detection system according to claim 1, wherein: the monocular distance measuring module measures distance by adopting a camera aperture imaging principle and a coordinate conversion relation, and calculates target distances of vehicles, pedestrians, riders and other obstacles at a distance.

6. A 3D object detection system according to claim 1, wherein: the binocular distance measuring module calculates depth information through the left-right parallax of the target and combining the base line distance and the focal length of the camera, and obtains the target distances of vehicles, pedestrians, riders and other obstacles at a distance.

7. A 3D object detection method employing the 3D object detection system according to claim 2, comprising the steps of:

8. The 3D object detection method according to claim 7, wherein the monocular ranging algorithm in step (4) is calculated using the following formula:

9. The 3D object detection method according to claim 7, wherein the binocular ranging algorithm of step (6) is calculated using the following formula: d=f×b/(xl-xr), where d is the target detection object depth distance, f is the camera focal length, b is the left and right monocular camera baseline distance, xl-xr is the corresponding pixel disparity value.

10. The 3D object detection method of claim 7, wherein the IOU calculation formula is as follows:

wherein A is the bounding box of object 1 and B is the bounding box of object 2; the Euclidean distance calculation formula is as follows:

11. The 3D target detection terminal device is characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, which processor, when executing the computer program, implements the steps of the 3D object detection method according to any of claims 7-10.

12. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor performs the steps of the 3D object detection method according to any of claims 7-10.