Object Detection and Localization Using Stereo Cameras
Object Detection and Localization Using Stereo Cameras
Abstract— Camera systems have become increasingly pop- with the outside world. Loss of visual function will greatly
ular because cameras are cheap and easy to deploy. Compared limit the ability to interact with the world. It limits blinds’
with other depth cameras, the stereo camera is small, and it movement in daily life, which they move slowly through
is easily carried by subjects. Through a fixed baseline, the
stereo camera is able to compute depth information. However, constant groping. Also, they cannot know whether there are
the traditional stereo matching algorithm can not compute the obstacles in front of them. Therefore, how the machine can
depth information on the edge of the image. Meanwhile, due interact with the visually impaired people in a special way
to the large amount in the 3D point cloud, there is no specific from the extracted external information is a big challenge.
numerical relationship between semantic information and depth Depth cameras obtain rich visual information, including
information. In order to solve this problem, estimating depth
and semantic information in an accurate way is required. depth information, image color, and semantic information af-
A deep neural network model is used to predict semantic ter processing by the brain. However, it isn’t easy to quickly
information and depth at the same time. Further, we propose accept and understand this information without visual inter-
a robust method to deal with variation brightness and improve action. Humans cannot understand three-dimensional depth
performance under actual conditions. information. In this paper, we seek a method that can extract
I. I NTRODUTION semantic features from images and extract depth information
Cameras used to be the main sensor for mobile robots for fast interaction with visually impaired people.
to avoid obstacles. Obtaining depth information and extract The rise of deep learning has allowed machines to
obstacle information through monocular, binocular, and RGB recognize real-world knowledge. Alexnet [5] uses a convo-
depth cameras. Stereo cameras can compute depth value lutional network to obtain image information. Compared to
with intrinsic camera parameters in the disparity map. RGB a fully connected model, sparse perception can filter useless
depth cameras can obtain a more accurate distance value information, enable the machine to focus and extract contour
by matching the structured light data. Obstacle information features. Meanwhile, it reduces the computation cost of the
will be used for motion planning or other control strategies. network model. Batch Normalize [6] makes it possible to
However, in human-computer interaction [1] [2] [3] [4]. deepen the number of model network layers. VGG [7] and
When passing 3D point cloud data to people, humans cannot Xception [8] have become common models for large-scale
process these data. When the blind wear guide glasses, image recognition networks. Resnet [9] uses the structure of
cameras can only deliver orders for the blind through voice redundant blocks to limit the value of gradient to return to
or vibration motor. Obviously, the visual redundancy infor- a reasonable range.
mation is difficult to inform the blind. The accuracy of object recognition has gradually im-
The visually impaired people have been a particular proved with the advancement of network models. The accu-
group. In the world, the number of blind people is close racy of object recognition has exceeded human performance.
to 300 million. Scientific research institutions have invested Object detection, as a branch of object recognition, has also
little in auxiliary equipment developed for those particular been improved by most researchers. Regardless of the input
groups. The main reason was that the weight and volume of size, Spatial Pyramid Pooling [10] can produce a fixed-size
sensors make it difficult to carry by the visually impaired output, use different scales as inputs to obtain pooling fea-
people. Vision is an important organ for human interaction tures for a fixed vector length. The Fast-RCNN [11] process
is more compact and significantly improves object detec-
This work was supported by “the Fundamental Research Funds for the tion speed. However, the two-step detection increases the
Central Universities”, “the Science and Technology Innovation Planning
Project of Ministry of Education of China”, “NVIDIA NVAIL program”,
computational overhead and fails to meet industry standards
“the National Natural Science Foundation of China under Grant No. in real-time. YOLO [9] adopts a one-step detection rate to
U1804161”, and Key Laboratory of Advanced Perception and Intelligent achieve the real-time effect of the network prediction speed
Control of High-end Equipment of Ministry of Education (Anhui Polytech-
nic University, Wuhu, China, 241000) under Grant Nos. GDSC202001 and
while ensuring the accuracy of object recognition. YOLO-v3
GDSC202007. And experiments are conducted on NVIDIA DGX-2. [12] further solves the problem of multi-scale observation,
1 Haoran Wu and Hongbo Gao are with the Department of Automa-
thereby improving the efficiency of object recognition, and
tion, University of Science and technology of China, Hefei, China. can identify smaller objects in the picture.
haoran.wu@outlook.com and ghb48@126.com
2 Hang Su is with the Department of Electronics, Information However, object detection can detect categories of ob-
and Bioengineering, Politecnico di Milano, 20133, Milan, Italy. jects and their pixel position. It cannot determine the position
hang.su@polimi.it of the object in three-dimensional space. This paper presents
3 Yueyue Liu is in School of Automation Science and Engi-
neering, South China University of Technology, Guangzhou, China. a new network to predict the position of the object in three-
lyy8313167@163.com dimensional space through stereo cameras.
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)
II. D EPTH E STIMATION AND SEMANTIC INFORMATION and right image coordinate systems are at the intersections
FROM C AMERAS O1 and O2 of the camera’s optical axis and the plane. The
A. Monocular feature corresponding coordinates of a point P in space in the left
and right images are P1 (u1 , v1 ) and P2 (u2 , v2 ), respectively.
It is true that people can obtain a certain depth infor-
From the triangular geometric relationship:
mation through one eye, but there are some factors that are
ignored. One is that people know the object model (prior b×f
knowledge), which includes size, shape, and color. When D= (1)
UR − UL
people observe an object with one eye, a rough distance can
be inferred based on the model we memorized; For example, where D is pixel depth. Define UR and UL is a pixel value,
humans can still identify categories of objects by observing corresponding to the right and left image. f is normalized
some features of the object; Part of the object may be blurred focal length. b is the baseline, which is the difference
due to camera focus issues. Humans can still identify object between the two cameras optical axes
categories by some local features. Second, when people In stereo matching, the left and right cameras have
observe objects with one eye, the human eye is actually different shooting angles. Generally, two images have a part
shaking. It is equivalent to the movement of a monocular of the same scene and a part of different scenes at the
camera, which is similar to the principle of structure from edges. When the camera capture perspective, the traditional
motion [13]. The monocular camera can obtain the depth stereo matching algorithm exists missing edge depth value.
information by calculating the disparity between the current As shown in Fig. 2, it uses a semi-global matching base left
frame and adjacent frames. The basis for image matching is stereo image. We can find the part of depth is missing due
image texture. Complex image textures will achieve better to right view cannot capture those scenes.
matching results. The lack of object texture and low texture
make part of the depth information missing. Further, the
estimation of the camera’s Euler angle and spatial position
will also cause a large deviation.
C. Depth estimation
Fig. 1. Relationship between pixel in stereo cameras and co-visible point
in three dimensional space SGM (semi-global matching [14]) is a cost aggregation
algorithm that is very similar to the cost aggregation in
local stereo matching algorithms. The global stereo matching
B. Stereo feature algorithms are adopted to achieve the same global energy
Stereo vision measurement is based on the disparity function and minimize cost resources. In this case, more or
map. Fig. 1 shows a simple stereo imaging principle. The all pixels of the image need to participate in the current pixel.
distance between the left and right cameras center is the A multi-path constraint aggregation for neighborhood op-
baseline. The origin of the camera coordinate system is at erations (neighborhood summation, weighted average, etc.)
the optical center of the camera lens. The coordinate system within a certain range. The cost aggregation process of the
is shown in Fig. 1. In fact, the imaging plane of the camera current pixel is affected by all pixels in multiple directions.
is behind the optical center of the lens. The left and right Neighborhood pixels in the different paths will affect the
imaging planes are drawn at f in front of the optical center total cost of the current pixel. It does not only guarantee the
of the lens. The u-axis and v-axis of this virtual image constraint of the global pixel but also reduces computational
plane coordinate system O1 (u, v), and the camera coordinate resource and avoid complex operators.
system are x-axes, and y-axis directions are the same, which The gradient information of the preprocessed image is
can simplify the calculation process. The origins of the left obtained by the gradient cost of the sampling-based method.
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)
Fig. 3. Two sub-modules estimate the binocular image depth and predict the object frame.
The sum of absolute differences (SAD) cost of the Function I[] is boolean function. Lr is the cost function
original image obtained based on the sampling method. accumulated in the relative area in the left image, P1 and P2
n n are the smoothing penalties which determine the disparity
2
2
between the pixel and the neighboring point is small or
C(u, v, d) = |IL (u + i, v + j)−
(2) large. Add all matching costs r (we choose n × n sliding
i=− n n
2 j=− 2
windows shown in (2)), to get the total matching cost. Post-
IR (u + d + i, v + j)| processing needs to generate a smooth disparity for the sub-
where Image matrics are RM ∗N , C(u, v, d) is aggregation pixel interpolation of the generated cost graph.
cost in pixel (u, v). n × n is relative area of pixel (u, v). IR III. N ETWORK A RCHITECTURE
and IL are grayscale image of left and right cameras.
There is a tailing effect due to the dynamic program- For estimating object depth in one-stage, we combine
ming algorithm. It is easy to generate error matches at the the deep estimation module with an object detection network.
edge of objects. Using state programming to accumulate one- Separated networks require additional memory storage and
dimensional energy will propagate the wrong depth infor- consume more computing resources. Some visual features in
mation to the subsequent paths. The semi-global algorithm both modules are common, as shown in Fig. 2.
uses information in multiple directions to try to eliminate the
A. Depth estimation
interference of misinformation, which can obviously reduce
the tailing effect produced by the dynamic programming In traditional stereo matching, the algorithm generates a
algorithm. disparity map by matching pixel pairs between images along
In this paper, the proposed algorithm is adopted to es- the scan path. To solve the non-textured areas, they increase
tablish a global Markov energy equation through constraints the size of the matching block and sum total cost. These
in one-dimensional paths in multiple directions on the image. costs are extracted from the larger receiving area. We use
The total matching cost of each pixel will accumulate all path the same strategy in the network. In particular, we use twin
information. Accumulate energy in each direction will add networks to share weights between two input images. As
the matching costs to get the total matching cost, as shown shown in Fig. 2.
in the following formula: Down-sampling is applied to the left stereo image
for feature extraction. To obtain a maximum information
stream, the dense module [15] is adopted to memory feature.
Lr (D) = {C(p, Dp ) + (P1 I[Dp − Dq = 1]+ These dense blocks combined by batch normalization and
p q∈N p
(3) convolution layer [16], Leaky Relu (rectified linear Unit Relu
(P1 I[Dp − Dq > 1])} [6]) activation. The output layer uses a single convolution
q∈N p block because the batch normalization will influence the
depth scale.
Ltotal (p, d) = Lr (p, d) (4) We subtract the feature vector of the left view and right
r
view, which compute in the previous module to create a
where Nq is relative pixel around current pixel p, Dp is rough disparity map. SGM use winner takes all strategy
disparity value which can get from the greyscale image. to select the depth with the smallest Euclidean distance
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)
Therefore, the network uses up-sampling with stride 2 to The object category loss Lcls uses multi-class cross
down-sampling 32 times. This method can lead to deep-level entropy [21]. Object center point prediction is a discrete
features into sub-resolution features. point. We use linear regression to predict the center point of
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)
Fig. 5. Depth estimation Error for the Cost Volume filter. left stereo image(left),predict depth image(middle),D-1 error(right)
the object. For the loss function of the object frame, we take B. Object detection base
the L1 smoothing loss. When the object frame is returned, The network begins to input left stereo image and down-
the network would be more robust. sampling the image three times, with a number of channels
E. Inline depth information of 32. In subsequent models, the down-sampling channels
are used as information to concatenate with the underlying
The generated image is limited by the Euler angle of
model.
stereo cameras. Still, the object has a certain volume so
Each network is trained with the same settings and
that it cannot calculate a single value to define the distance
tested with single precision. The operating environment is
between stereo camera and objects. Object detection module
GTX1060M, as shown in Table 1. In the DenseNet model,
computes bounding boxes are the maximum envelope of the
since the information of each layer is forwarded, the bot-
object. Meanwhile, the depth estimation network provides
tom layer also has high-level information. But compare
a depth map. In the depth map, the depth values of a
DenseNet-121 and DenseNet-169. We found that although
single object are often continuous. We can consider the
the number of network layers increased, the overall AP
depth values in an object are the largest interior-point set.
decreased. However, the increase in the number of network
Here, RANSAC (Random sample consensus [22]) is used to
layers increases the object recognition rate and IOU for
find the maximum internal point set of the outer envelope,
medium and large areas. ResNet-53, the effect is slightly
and the single depth value of the object is obtained by the
worse, but the AP value of small objects increases. In the
centroid method. Obtain the largest set of interior points to
Mobile Net model, we try to use it in more lighten Networks.
the prediction frame, and use the centroid distance to obtain
Although it gets a low score, it can process stereo images in
object depth information.
real-time in an embed system or standard CPU.
IV. E XPERIMENT
TABLE I
A. Brightness processing O BJECT D ETECT MODULE RESULT IN COCO DATASET 2017, APS AP
Fig. 5 shows generate depth images through the net- OF A SMALL OBJECT,APM AP OF THE MIDDLE OBJECT,AP AP OF ALL
work. It is easy to find that the accuracy of stereo matching SCALE SIZE
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)
features in different fields of view. However, the network [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
is learned end-to-end, and no hyper-parameters need to be large-scale image recognition,” IEEE Conference on Computer Vision
and Pattern Recognition.
defined. Aggregation is determined by the weight matrix and [8] F. Chollet, “Xception: Deep learning with depthwise separable convo-
the convolution kernel. This makes the parallax network learn lutions,” IEEE Conference on Computer Vision and Pattern Recogni-
effective object structures and perceive untextured areas over tion, 2016.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
a wide range. Compared to the original SGBM, approxi- image recognition,” IEEE Conference on Computer Vision and Pattern
mations in large untextured regions are avoided. Compared Recognition, 2015.
with the maximum and minimum values, the soft weighting [10] ——, “Spatial pyramid pooling in deep convolutional networks for vi-
sual recognition,” IEEE Transactions on Pattern Analysis and Machine
method avoids aliasing in the depth information and the Intelligence, 2015.
object edges are smooth. For benchmarking, we tested with [11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
the scene flow dataset in KITTI. By evaluating the test time object detection with region proposal networks,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp.
set, the 3-pixel threshold error rate reached 3.4%, and the 1137–1149, 2017.
KITTI 2015 and KITTI 2012 [23] datasets were fine-tuned [12] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement.”
respectively. The model is then evaluated on the test set. IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[13] S. Ullman, “The interpretation of structure from motion,” The Royal
It can restore the object structure well. Fig. 5 shows the Society of London. Series B. Biological Sciences, vol. 203, no. 1153,
generated depth map and the D1 error. It can be seen from the pp. 405–426, 1979.
generated depth map with multi-resolution interpolation. The [14] H. Hirschmuller, “Stereo processing by semiglobal matching and
mutual information,” IEEE Transactions on Pattern Analysis and
depth value of objects is continuous. And the edge of objects Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2007.
can be clearly seen. The D1-all error rate is 3.55%. A deeper [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
network can go for better results, but our goal is to simply connected convolutional networks,” 2017, pp. 4700–4708.
[16] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recog-
represent semantic information and depth information. No nition: A convolutional neural-network approach,” IEEE Transactions
more network is needed to add texture details. on Neural Networks, vol. 8, no. 1, pp. 98–113, 1997.
[17] F. N. Fritsch and R. E. Carlson, “Monotone piecewise cubic inter-
V. CONCLUSIONS polation,” SIAM Journal on Numerical Analysis, vol. 17, no. 2, pp.
238–246, 1980.
In this paper, we combine object detection with a depth [18] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time
estimation to compute the distance between an object and a bilinear interpolation,” in Proceedings. DELTA 2004. Second IEEE
stereo camera in real-time. The two-training strategy makes International Workshop on Electronic Design, Test and Applications.
IEEE, 2004, pp. 126–131.
it possible to train a one-time estimation network without [19] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
creating a new data set. A larger 3D convolution module “Feature pyramid networks for object detection,” IEEE Conference on
can be introduced in object detection to obtain better results Computer Vision and Pattern Recognition, July 2017.
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
in the object prediction frame. Because of the angle of the P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
camera, an object may have different depth ranges. The context,” European Conference on Computer Vision, pp. 740–755,
differences between the distance calculated by the centroid 2014.
[21] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A
method and the edge depth are large. In future work will try tutorial on the cross-entropy method,” Annals of operations research,
to limit the depth range or re-projection in two-dimensional vol. 134, no. 1, pp. 19–67, 2005.
space. Semi-global aggregation can be introduced into the [22] M. A. Fischler and R. C. Bolles, “Random sample consensus: a
paradigm for model fitting with applications to image analysis and
disparity estimation to refine the texture. Make object depth automated cartography,” Communications of the ACM, vol. 24, no. 6,
smoother. For reflective areas and thin structures, more image pp. 381–395, 1981.
detail processing is required. [23] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” The International Journal of Robotics Research,
R EFERENCES vol. 32, no. 11, pp. 1231–1237, 2013.
[1] H. Su, S. E. Ovur, X. Zhou, W. Qi, G. Ferrigno, and E. De Momi,
“Depth vision guided hand gesture recognition using electromyo-
graphic signals,” Advanced Robotics, pp. 1–13, 2020.
[2] Z. Li, Y. Yuan, L. Luo, W. Su, K. Zhao, C. Xu, J. Huang, and M. Pi,
“Hybrid brain/muscle signals powered wearable walking exoskeleton
enhancing motor ability in climbing stairs activity,” IEEE Transactions
on Medical Robotics and Bionics, vol. 1, no. 4, pp. 218–227, Nov.
2019.
[3] Z. Li, J. Li, S. Zhao, Y. Yuan, Y. Kang, and C. L. P. Chen, “Adaptive
neural control of a kinematically redundant exoskeleton robot using
brain–machine interfaces,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 30, no. 12, pp. 3558–3571, 2019.
[4] Y. Liu, W. Su, Z. Li, G. Shi, X. Chu, Y. Kang, and W. Shang, “Motor
imagery based teleoperation of a dual-arm robot performing manip-
ulation tasks,” IEEE Transactions on Cognitive and Developmental
Systems, vol. 11, no. 3, pp. 414–424, Sept. 2019.
[5] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in Neural Infor-
mation Processing Systems, vol. 25, no. 2, 2012.
[6] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” The 27th International Conference on Machine
Learning, pp. 807–814, 2010.
Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.