0% found this document useful (0 votes)

36 views6 pages

Object Detection and Localization Using Stereo Cameras

Uploaded by

swaroopatreya5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

Object Detection and Localization Using Stereo Cameras

Uploaded by

swaroopatreya5

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 5th International Conference on Advanced

Robotics and Mechatronics (ICARM)

Object Detection and Localization Using Stereo Cameras

Haoran Wu1 , Hang Su2 , Yueyue Liu3 and Hongbo Gao1

Abstract— Camera systems have become increasingly pop- with the outside world. Loss of visual function will greatly
ular because cameras are cheap and easy to deploy. Compared limit the ability to interact with the world. It limits blinds’
with other depth cameras, the stereo camera is small, and it movement in daily life, which they move slowly through
is easily carried by subjects. Through a fixed baseline, the
stereo camera is able to compute depth information. However, constant groping. Also, they cannot know whether there are
the traditional stereo matching algorithm can not compute the obstacles in front of them. Therefore, how the machine can
depth information on the edge of the image. Meanwhile, due interact with the visually impaired people in a special way
to the large amount in the 3D point cloud, there is no specific from the extracted external information is a big challenge.
numerical relationship between semantic information and depth Depth cameras obtain rich visual information, including
information. In order to solve this problem, estimating depth
and semantic information in an accurate way is required. depth information, image color, and semantic information af-
A deep neural network model is used to predict semantic ter processing by the brain. However, it isn’t easy to quickly
information and depth at the same time. Further, we propose accept and understand this information without visual inter-
a robust method to deal with variation brightness and improve action. Humans cannot understand three-dimensional depth
performance under actual conditions. information. In this paper, we seek a method that can extract
I. I NTRODUTION semantic features from images and extract depth information
Cameras used to be the main sensor for mobile robots for fast interaction with visually impaired people.
to avoid obstacles. Obtaining depth information and extract The rise of deep learning has allowed machines to
obstacle information through monocular, binocular, and RGB recognize real-world knowledge. Alexnet [5] uses a convo-
depth cameras. Stereo cameras can compute depth value lutional network to obtain image information. Compared to
with intrinsic camera parameters in the disparity map. RGB a fully connected model, sparse perception can filter useless
depth cameras can obtain a more accurate distance value information, enable the machine to focus and extract contour
by matching the structured light data. Obstacle information features. Meanwhile, it reduces the computation cost of the
will be used for motion planning or other control strategies. network model. Batch Normalize [6] makes it possible to
However, in human-computer interaction [1] [2] [3] [4]. deepen the number of model network layers. VGG [7] and
When passing 3D point cloud data to people, humans cannot Xception [8] have become common models for large-scale
process these data. When the blind wear guide glasses, image recognition networks. Resnet [9] uses the structure of
cameras can only deliver orders for the blind through voice redundant blocks to limit the value of gradient to return to
or vibration motor. Obviously, the visual redundancy infor- a reasonable range.
mation is difficult to inform the blind. The accuracy of object recognition has gradually im-
The visually impaired people have been a particular proved with the advancement of network models. The accu-
group. In the world, the number of blind people is close racy of object recognition has exceeded human performance.
to 300 million. Scientific research institutions have invested Object detection, as a branch of object recognition, has also
little in auxiliary equipment developed for those particular been improved by most researchers. Regardless of the input
groups. The main reason was that the weight and volume of size, Spatial Pyramid Pooling [10] can produce a fixed-size
sensors make it difficult to carry by the visually impaired output, use different scales as inputs to obtain pooling fea-
people. Vision is an important organ for human interaction tures for a fixed vector length. The Fast-RCNN [11] process
is more compact and significantly improves object detec-
This work was supported by “the Fundamental Research Funds for the tion speed. However, the two-step detection increases the
Central Universities”, “the Science and Technology Innovation Planning
Project of Ministry of Education of China”, “NVIDIA NVAIL program”,
computational overhead and fails to meet industry standards
“the National Natural Science Foundation of China under Grant No. in real-time. YOLO [9] adopts a one-step detection rate to
U1804161”, and Key Laboratory of Advanced Perception and Intelligent achieve the real-time effect of the network prediction speed
Control of High-end Equipment of Ministry of Education (Anhui Polytech-
nic University, Wuhu, China, 241000) under Grant Nos. GDSC202001 and
while ensuring the accuracy of object recognition. YOLO-v3
GDSC202007. And experiments are conducted on NVIDIA DGX-2. [12] further solves the problem of multi-scale observation,
1 Haoran Wu and Hongbo Gao are with the Department of Automa-
thereby improving the efficiency of object recognition, and
tion, University of Science and technology of China, Hefei, China. can identify smaller objects in the picture.
haoran.wu@outlook.com and ghb48@126.com
2 Hang Su is with the Department of Electronics, Information However, object detection can detect categories of ob-
and Bioengineering, Politecnico di Milano, 20133, Milan, Italy. jects and their pixel position. It cannot determine the position
hang.su@polimi.it of the object in three-dimensional space. This paper presents
3 Yueyue Liu is in School of Automation Science and Engi-
neering, South China University of Technology, Guangzhou, China. a new network to predict the position of the object in three-
lyy8313167@163.com dimensional space through stereo cameras.

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 628

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.
2020 5th International Conference on Advanced
Robotics and Mechatronics (ICARM)

II. D EPTH E STIMATION AND SEMANTIC INFORMATION and right image coordinate systems are at the intersections
FROM C AMERAS O1 and O2 of the camera’s optical axis and the plane. The
A. Monocular feature corresponding coordinates of a point P in space in the left
and right images are P1 (u1 , v1 ) and P2 (u2 , v2 ), respectively.
It is true that people can obtain a certain depth infor-
From the triangular geometric relationship:
mation through one eye, but there are some factors that are
ignored. One is that people know the object model (prior b×f
knowledge), which includes size, shape, and color. When D= (1)
UR − UL
people observe an object with one eye, a rough distance can
be inferred based on the model we memorized; For example, where D is pixel depth. Deﬁne UR and UL is a pixel value,
humans can still identify categories of objects by observing corresponding to the right and left image. f is normalized
some features of the object; Part of the object may be blurred focal length. b is the baseline, which is the difference
due to camera focus issues. Humans can still identify object between the two cameras optical axes
categories by some local features. Second, when people In stereo matching, the left and right cameras have
observe objects with one eye, the human eye is actually different shooting angles. Generally, two images have a part
shaking. It is equivalent to the movement of a monocular of the same scene and a part of different scenes at the
camera, which is similar to the principle of structure from edges. When the camera capture perspective, the traditional
motion [13]. The monocular camera can obtain the depth stereo matching algorithm exists missing edge depth value.
information by calculating the disparity between the current As shown in Fig. 2, it uses a semi-global matching base left
frame and adjacent frames. The basis for image matching is stereo image. We can ﬁnd the part of depth is missing due
image texture. Complex image textures will achieve better to right view cannot capture those scenes.
matching results. The lack of object texture and low texture
make part of the depth information missing. Further, the
estimation of the camera’s Euler angle and spatial position
will also cause a large deviation.

Fig. 2. Depth map generate from Semi-global matching. Missing part of

depth information

C. Depth estimation
Fig. 1. Relationship between pixel in stereo cameras and co-visible point
in three dimensional space SGM (semi-global matching [14]) is a cost aggregation
algorithm that is very similar to the cost aggregation in
local stereo matching algorithms. The global stereo matching
B. Stereo feature algorithms are adopted to achieve the same global energy
Stereo vision measurement is based on the disparity function and minimize cost resources. In this case, more or
map. Fig. 1 shows a simple stereo imaging principle. The all pixels of the image need to participate in the current pixel.
distance between the left and right cameras center is the A multi-path constraint aggregation for neighborhood op-
baseline. The origin of the camera coordinate system is at erations (neighborhood summation, weighted average, etc.)
the optical center of the camera lens. The coordinate system within a certain range. The cost aggregation process of the
is shown in Fig. 1. In fact, the imaging plane of the camera current pixel is affected by all pixels in multiple directions.
is behind the optical center of the lens. The left and right Neighborhood pixels in the different paths will affect the
imaging planes are drawn at f in front of the optical center total cost of the current pixel. It does not only guarantee the
of the lens. The u-axis and v-axis of this virtual image constraint of the global pixel but also reduces computational
plane coordinate system O1 (u, v), and the camera coordinate resource and avoid complex operators.
system are x-axes, and y-axis directions are the same, which The gradient information of the preprocessed image is
can simplify the calculation process. The origins of the left obtained by the gradient cost of the sampling-based method.

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 629

Fig. 3. Two sub-modules estimate the binocular image depth and predict the object frame.

The sum of absolute differences (SAD) cost of the Function I[] is boolean function. Lr is the cost function
original image obtained based on the sampling method. accumulated in the relative area in the left image, P1 and P2
n n are the smoothing penalties which determine the disparity

2
2
between the pixel and the neighboring point is small or
C(u, v, d) = |IL (u + i, v + j)−
(2) large. Add all matching costs r (we choose n × n sliding
i=− n n
2 j=− 2
windows shown in (2)), to get the total matching cost. Post-
IR (u + d + i, v + j)| processing needs to generate a smooth disparity for the sub-
where Image matrics are RM ∗N , C(u, v, d) is aggregation pixel interpolation of the generated cost graph.
cost in pixel (u, v). n × n is relative area of pixel (u, v). IR III. N ETWORK A RCHITECTURE
and IL are grayscale image of left and right cameras.
There is a tailing effect due to the dynamic program- For estimating object depth in one-stage, we combine
ming algorithm. It is easy to generate error matches at the the deep estimation module with an object detection network.
edge of objects. Using state programming to accumulate one- Separated networks require additional memory storage and
dimensional energy will propagate the wrong depth infor- consume more computing resources. Some visual features in
mation to the subsequent paths. The semi-global algorithm both modules are common, as shown in Fig. 2.
uses information in multiple directions to try to eliminate the
A. Depth estimation
interference of misinformation, which can obviously reduce
the tailing effect produced by the dynamic programming In traditional stereo matching, the algorithm generates a
algorithm. disparity map by matching pixel pairs between images along
In this paper, the proposed algorithm is adopted to es- the scan path. To solve the non-textured areas, they increase
tablish a global Markov energy equation through constraints the size of the matching block and sum total cost. These
in one-dimensional paths in multiple directions on the image. costs are extracted from the larger receiving area. We use
The total matching cost of each pixel will accumulate all path the same strategy in the network. In particular, we use twin
information. Accumulate energy in each direction will add networks to share weights between two input images. As
the matching costs to get the total matching cost, as shown shown in Fig. 2.
in the following formula: Down-sampling is applied to the left stereo image
for feature extraction. To obtain a maximum information
stream, the dense module [15] is adopted to memory feature.
Lr (D) = {C(p, Dp ) + (P1 I[Dp − Dq = 1]+ These dense blocks combined by batch normalization and
p q∈N p
(3) convolution layer [16], Leaky Relu (rectiﬁed linear Unit Relu
(P1 I[Dp − Dq > 1])} [6]) activation. The output layer uses a single convolution
q∈N p block because the batch normalization will inﬂuence the
depth scale.
Ltotal (p, d) = Lr (p, d) (4) We subtract the feature vector of the left view and right
r
view, which compute in the previous module to create a
where Nq is relative pixel around current pixel p, Dp is rough disparity map. SGM use winner takes all strategy
disparity value which can get from the greyscale image. to select the depth with the smallest Euclidean distance

&&$" C. Twice training strategy

The stereo image should be used both in predicting
bounding boxes and estimate depth information. There is no
"! database with both object detection and provide the ground-
truth value measured by higher precise sense. We take twice
!%
the training strategy to allow the network to have an ability
"!& ")!
"* to estimates and generate a bounding box by calculating only
#!
once. The stereo disparity database used is the KITTI, and
#% #!
the object detection database is selected from Microsoft’s
"*
!%
"!& ")!
*&$&&)"$
COCO database [20].
#!
During the training process, because the depth estimates
#% #! the true image texture, in the multi-scale extraction module,
!%
"!& ")!
the input left image down-sampling to multiple resolutions
"* "%&"'
#! for both object detection module and depth estimation. It
prevents some missing edge features from getting additional
errors due to filter by upper convolution layer. At the same
!%"'
time, multiple down-sampling images will be used for object
!%
detection module with different scales. This allows objects
&
"!("'&"!
"$ ,&"!
+' with small scales in the prediction map and concatenates
lower features with deep expression.
D. Loss function
Fig. 4. Object detection with multi-resolution searchinng. We use a smooth L1 loss function to train our model.
Compared to L2 loss, smoothing L1 is robust at parallax
discontinuities and has low sensitivity to outliers or noise.
between two feature vectors. The network aggregates the The loss function of the training model is defined as:
costs across multiple resolutions. Still, our target is to obtain N
ˆ d) = 1
a single depth value. So it is not required higher resolution L(d, fL (|dˆ − d|) (5)
to refine image details, such as some edge of objects. Cubic N i=1 1
interpolation [17] may make the edges of the depth image

smoother, but it involves a higher computational cost. We where dˆ − d illustrate the absolute depth prediction, and
use bilinear interpolated [18] to refine image details and use N is the number of valid pixels which less than maximum
them in the up-sampling module. depth threshold. The smoothing loss function can be more
robust when there is more noise in the image.
B. Object detect strategy
x − 0.5 |x| 1
f L1 = 1 2 (6)
Fig. 4 illustrates the network graph. This module is 2x |x| < 1
modified from the Yolo network to detect the objects. The
dense module is also involved in this approach. Compared The loss function L1 determines the robustness. It prevents
with the residual module, it increases a new information flow involving erroneous data in the network from result map,
in order to reduce the negative effects of gradients cause by such as bright spots in the image due to the reflection of
pooling. The network adopts a fusion approach similar to the glass, which is misleading areas. In the object detection
Feature Pyramid Networks [19], and performs detection on module, the network needs to predict the frame and the
a feature map of multiple scales. center point of the object according to the object category,
object. The following formula defines the loss function for
This module uses three times to detect in multi-
the detection network.
resolution, which were detected at 32x down-sampling, 16x pi
down-sampling, and 8x downsampling. We do three times 1
L({pi }, {ti }) = Lcls (pi ∗ − pi )+
down-sampling for left stereo image and concatenate with Ncls i=1
the underlying feature network. Most features in high times ti
1
down-sampling are features of large-scale objects, and those Lreg (ti ∗ − ti )+
small-scale object features are often hidden in the low time Nreg i=1
down-sampling layer. In order to introduce low-level features 1
into the upper layers, up-sampling can be used in the network fL1 (bi ∗ − bi ) (7)
N
is that the deep network shows better feature expression. i=x,y,w,h

Therefore, the network uses up-sampling with stride 2 to The object category loss Lcls uses multi-class cross
down-sampling 32 times. This method can lead to deep-level entropy [21]. Object center point prediction is a discrete
features into sub-resolution features. point. We use linear regression to predict the center point of

Fig. 5. Depth estimation Error for the Cost Volume ﬁlter. left stereo image(left),predict depth image(middle),D-1 error(right)

the object. For the loss function of the object frame, we take B. Object detection base
the L1 smoothing loss. When the object frame is returned, The network begins to input left stereo image and down-
the network would be more robust. sampling the image three times, with a number of channels
E. Inline depth information of 32. In subsequent models, the down-sampling channels
are used as information to concatenate with the underlying
The generated image is limited by the Euler angle of
model.
stereo cameras. Still, the object has a certain volume so
Each network is trained with the same settings and
that it cannot calculate a single value to deﬁne the distance
tested with single precision. The operating environment is
between stereo camera and objects. Object detection module
GTX1060M, as shown in Table 1. In the DenseNet model,
computes bounding boxes are the maximum envelope of the
since the information of each layer is forwarded, the bot-
object. Meanwhile, the depth estimation network provides
tom layer also has high-level information. But compare
a depth map. In the depth map, the depth values of a
DenseNet-121 and DenseNet-169. We found that although
single object are often continuous. We can consider the
the number of network layers increased, the overall AP
depth values in an object are the largest interior-point set.
decreased. However, the increase in the number of network
Here, RANSAC (Random sample consensus [22]) is used to
layers increases the object recognition rate and IOU for
ﬁnd the maximum internal point set of the outer envelope,
medium and large areas. ResNet-53, the effect is slightly
and the single depth value of the object is obtained by the
worse, but the AP value of small objects increases. In the
centroid method. Obtain the largest set of interior points to
Mobile Net model, we try to use it in more lighten Networks.
the prediction frame, and use the centroid distance to obtain
Although it gets a low score, it can process stereo images in
object depth information.
real-time in an embed system or standard CPU.
IV. E XPERIMENT
TABLE I
A. Brightness processing O BJECT D ETECT MODULE RESULT IN COCO DATASET 2017, APS AP
Fig. 5 shows generate depth images through the net- OF A SMALL OBJECT,APM AP OF THE MIDDLE OBJECT,AP AP OF ALL

work. It is easy to ﬁnd that the accuracy of stereo matching SCALE SIZE

depends on whether there is discrimination in image blocks.

If part of the image has a non-texture area or has a large backbone AP APS APM
number of repeated textures (such as walls) and lacks valid Dense-121 23.6 18.6 25.7
Dense-169 22.5 17.4 25.8
texture information, the N calculation is likely to match Resnet-53 21.4 19.8 23.9
the wrong pixel block. Small blocks with valid gradients MobileNetv2 16.4 15.9 17.7
will have better discrimination and are less likely to cause MobileNetv3 17.7 16.9 18.0
mismatches. For pixels with inconspicuous gradients, it is
difﬁcult to estimate the depth effectively because there is
no discrimination in block matching. Conversely, the depth C. Depth module
information obtained where the pixel gradient is relatively Cost Volume Filter is based on the idea of SGM. The
accurate. cost of multiple directions is aggregated to obtain image

features in different fields of view. However, the network [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
is learned end-to-end, and no hyper-parameters need to be large-scale image recognition,” IEEE Conference on Computer Vision
and Pattern Recognition.
defined. Aggregation is determined by the weight matrix and [8] F. Chollet, “Xception: Deep learning with depthwise separable convo-
the convolution kernel. This makes the parallax network learn lutions,” IEEE Conference on Computer Vision and Pattern Recogni-
effective object structures and perceive untextured areas over tion, 2016.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
a wide range. Compared to the original SGBM, approxi- image recognition,” IEEE Conference on Computer Vision and Pattern
mations in large untextured regions are avoided. Compared Recognition, 2015.
with the maximum and minimum values, the soft weighting [10] ——, “Spatial pyramid pooling in deep convolutional networks for vi-
sual recognition,” IEEE Transactions on Pattern Analysis and Machine
method avoids aliasing in the depth information and the Intelligence, 2015.
object edges are smooth. For benchmarking, we tested with [11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
the scene flow dataset in KITTI. By evaluating the test time object detection with region proposal networks,” IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp.
set, the 3-pixel threshold error rate reached 3.4%, and the 1137–1149, 2017.
KITTI 2015 and KITTI 2012 [23] datasets were fine-tuned [12] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement.”
respectively. The model is then evaluated on the test set. IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[13] S. Ullman, “The interpretation of structure from motion,” The Royal
It can restore the object structure well. Fig. 5 shows the Society of London. Series B. Biological Sciences, vol. 203, no. 1153,
generated depth map and the D1 error. It can be seen from the pp. 405–426, 1979.
generated depth map with multi-resolution interpolation. The [14] H. Hirschmuller, “Stereo processing by semiglobal matching and
mutual information,” IEEE Transactions on Pattern Analysis and
depth value of objects is continuous. And the edge of objects Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2007.
can be clearly seen. The D1-all error rate is 3.55%. A deeper [15] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
network can go for better results, but our goal is to simply connected convolutional networks,” 2017, pp. 4700–4708.
[16] S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recog-
represent semantic information and depth information. No nition: A convolutional neural-network approach,” IEEE Transactions
more network is needed to add texture details. on Neural Networks, vol. 8, no. 1, pp. 98–113, 1997.
[17] F. N. Fritsch and R. E. Carlson, “Monotone piecewise cubic inter-
V. CONCLUSIONS polation,” SIAM Journal on Numerical Analysis, vol. 17, no. 2, pp.
238–246, 1980.
In this paper, we combine object detection with a depth [18] K. T. Gribbon and D. G. Bailey, “A novel approach to real-time
estimation to compute the distance between an object and a bilinear interpolation,” in Proceedings. DELTA 2004. Second IEEE
stereo camera in real-time. The two-training strategy makes International Workshop on Electronic Design, Test and Applications.
IEEE, 2004, pp. 126–131.
it possible to train a one-time estimation network without [19] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,
creating a new data set. A larger 3D convolution module “Feature pyramid networks for object detection,” IEEE Conference on
can be introduced in object detection to obtain better results Computer Vision and Pattern Recognition, July 2017.
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
in the object prediction frame. Because of the angle of the P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
camera, an object may have different depth ranges. The context,” European Conference on Computer Vision, pp. 740–755,
differences between the distance calculated by the centroid 2014.
[21] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A
method and the edge depth are large. In future work will try tutorial on the cross-entropy method,” Annals of operations research,
to limit the depth range or re-projection in two-dimensional vol. 134, no. 1, pp. 19–67, 2005.
space. Semi-global aggregation can be introduced into the [22] M. A. Fischler and R. C. Bolles, “Random sample consensus: a
paradigm for model fitting with applications to image analysis and
disparity estimation to refine the texture. Make object depth automated cartography,” Communications of the ACM, vol. 24, no. 6,
smoother. For reflective areas and thin structures, more image pp. 381–395, 1981.
detail processing is required. [23] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The kitti dataset,” The International Journal of Robotics Research,
R EFERENCES vol. 32, no. 11, pp. 1231–1237, 2013.
[1] H. Su, S. E. Ovur, X. Zhou, W. Qi, G. Ferrigno, and E. De Momi,
“Depth vision guided hand gesture recognition using electromyo-
graphic signals,” Advanced Robotics, pp. 1–13, 2020.
[2] Z. Li, Y. Yuan, L. Luo, W. Su, K. Zhao, C. Xu, J. Huang, and M. Pi,
“Hybrid brain/muscle signals powered wearable walking exoskeleton
enhancing motor ability in climbing stairs activity,” IEEE Transactions
on Medical Robotics and Bionics, vol. 1, no. 4, pp. 218–227, Nov.
2019.
[3] Z. Li, J. Li, S. Zhao, Y. Yuan, Y. Kang, and C. L. P. Chen, “Adaptive
neural control of a kinematically redundant exoskeleton robot using
brain–machine interfaces,” IEEE Transactions on Neural Networks and
Learning Systems, vol. 30, no. 12, pp. 3558–3571, 2019.
[4] Y. Liu, W. Su, Z. Li, G. Shi, X. Chu, Y. Kang, and W. Shang, “Motor
imagery based teleoperation of a dual-arm robot performing manip-
ulation tasks,” IEEE Transactions on Cognitive and Developmental
Systems, vol. 11, no. 3, pp. 414–424, Sept. 2019.
[5] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification
with deep convolutional neural networks,” Advances in Neural Infor-
mation Processing Systems, vol. 25, no. 2, 2012.
[6] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” The 27th International Conference on Machine
Learning, pp. 807–814, 2010.

Authorized licensed use limited to: UNIVERSITY OF ROCHESTER. Downloaded on September 20,2020 at 20:55:59 UTC from IEEE Xplore. Restrictions apply.

Object Recognition
No ratings yet
Object Recognition
49 pages
I2c-Net Using Instance-Level Neural Networks For M
No ratings yet
I2c-Net Using Instance-Level Neural Networks For M
8 pages
(Paper 5) Kira2012
No ratings yet
(Paper 5) Kira2012
8 pages
Integrating Visual and Range Data For Robotic Object Detection
No ratings yet
Integrating Visual and Range Data For Robotic Object Detection
12 pages
Unsupervised Domain Adaptation For Depth Prediction From Images
No ratings yet
Unsupervised Domain Adaptation For Depth Prediction From Images
14 pages
Monocular 3D Detection for Autonomous Cars
No ratings yet
Monocular 3D Detection for Autonomous Cars
2 pages
Applsci 12 01354 v2
No ratings yet
Applsci 12 01354 v2
14 pages
IJRPR10585
No ratings yet
IJRPR10585
4 pages
Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe
No ratings yet
Optimized Visual Recognition Algorithm in Service Robots: Junwwu, Wei Cai, Shi M Yu, Zhuo L Xu Andxueyhe
11 pages
Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
No ratings yet
Autoshape: Real-Time Shape-Aware Monocular 3D Object Detection
11 pages
Advancing 3D Point Cloud Understanding Through Deep Transfer Learning: A Comprehensive Survey
No ratings yet
Advancing 3D Point Cloud Understanding Through Deep Transfer Learning: A Comprehensive Survey
38 pages
Machine Learning for Object Detection
No ratings yet
Machine Learning for Object Detection
8 pages
A Survey of 3D Object Detection: Wei Liang Pengfei Xu Ling Guo Heng Bai Yang Zhou Feng Chen
No ratings yet
A Survey of 3D Object Detection: Wei Liang Pengfei Xu Ling Guo Heng Bai Yang Zhou Feng Chen
25 pages
Object Detection and Translation For Bli
No ratings yet
Object Detection and Translation For Bli
6 pages
Efficient Detection and Tracking of Human Using 3D LiDAR Sensor
No ratings yet
Efficient Detection and Tracking of Human Using 3D LiDAR Sensor
12 pages
Object Detection Using ELAN
No ratings yet
Object Detection Using ELAN
6 pages
A Low-Cost Stereo System For 3D Object Recognition
No ratings yet
A Low-Cost Stereo System For 3D Object Recognition
7 pages
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
No ratings yet
Dark Green Light Green White Corporate Geometric Company Internal Deck Business Presentation
17 pages
Real Time Object Detection Using Deep Learning
No ratings yet
Real Time Object Detection Using Deep Learning
6 pages
A Lightweight Robust Distance Estimation Method
No ratings yet
A Lightweight Robust Distance Estimation Method
20 pages
Remote Distance Measurement From A Single Image by Automatic Detection and Perspective Correction
No ratings yet
Remote Distance Measurement From A Single Image by Automatic Detection and Perspective Correction
24 pages
BirdNet A 3D Object Detection Framework
No ratings yet
BirdNet A 3D Object Detection Framework
8 pages
Deep Learning for Object Tracking
No ratings yet
Deep Learning for Object Tracking
3 pages
3D Tomatoes' Localisation With Monocular Cameras Using Histogram Filters
No ratings yet
3D Tomatoes' Localisation With Monocular Cameras Using Histogram Filters
15 pages
Neural Networks for Depth Sensors
No ratings yet
Neural Networks for Depth Sensors
11 pages
Applied Sciences: Applications of Computer Vision in Automation and Robotics
No ratings yet
Applied Sciences: Applications of Computer Vision in Automation and Robotics
3 pages
IJCDS MSDF ResNET
No ratings yet
IJCDS MSDF ResNET
11 pages
A Deep Learning Based Assistant For The Visually Impaired
No ratings yet
A Deep Learning Based Assistant For The Visually Impaired
11 pages
Document
No ratings yet
Document
51 pages
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
No ratings yet
Yang, Liu Et Al 2022 - Graph R-CNN - Towards Accurate 3D Object Detection With Semantic-Decorated Local Graph
18 pages
AI Models For 3D Object Detection in Autonomous Systems: Leveraging LiDAR and Depth Sensing
No ratings yet
AI Models For 3D Object Detection in Autonomous Systems: Leveraging LiDAR and Depth Sensing
8 pages
3D Object Detections and Recognitions: Assisting Visually Impaired People in Daily Activities
No ratings yet
3D Object Detections and Recognitions: Assisting Visually Impaired People in Daily Activities
28 pages
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
No ratings yet
Park Depth Prompting For Sensor-Agnostic Depth Estimation CVPR 2024 Paper
11 pages
Real-Time CNN Visual Recognition
No ratings yet
Real-Time CNN Visual Recognition
13 pages
DeepPrimitive: Layered Image Decomposition
No ratings yet
DeepPrimitive: Layered Image Decomposition
13 pages
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
No ratings yet
Dattnet: Monocular Depth Estimation Network Based On Attention Mechanisms
10 pages
Nerf RPN
No ratings yet
Nerf RPN
13 pages
Object Detector For Blind Person
No ratings yet
Object Detector For Blind Person
20 pages
Electronics 12 02323 v2
No ratings yet
Electronics 12 02323 v2
14 pages
2207.02696v1 2
No ratings yet
2207.02696v1 2
15 pages
2018 Frustum PointNets
No ratings yet
2018 Frustum PointNets
10 pages
Yolov 7
100% (1)
Yolov 7
17 pages
5 Ijlemr 77839
No ratings yet
5 Ijlemr 77839
5 pages
CARTO Category and Joint Agnostic Reconstruction
No ratings yet
CARTO Category and Joint Agnostic Reconstruction
10 pages
OD Trans Christopher-Lang2022 Q2
No ratings yet
OD Trans Christopher-Lang2022 Q2
15 pages
Brooks 1981
No ratings yet
Brooks 1981
64 pages
Object Tracking in Crowd Environment Using Deep Learning
No ratings yet
Object Tracking in Crowd Environment Using Deep Learning
8 pages
Yolo Vs RCNN
No ratings yet
Yolo Vs RCNN
5 pages
Stereo 3D Object Detection Network
No ratings yet
Stereo 3D Object Detection Network
9 pages
Stereo-Based Pedestrian Detection For Collision-Avoidance Applications
No ratings yet
Stereo-Based Pedestrian Detection For Collision-Avoidance Applications
12 pages
IJCRT2308153
No ratings yet
IJCRT2308153
9 pages
TFG Gabriel-Ciprian Dinu 2019
No ratings yet
TFG Gabriel-Ciprian Dinu 2019
60 pages
Towards Generalizable Multi Camera 3d Object Detection Via 2rp452obxu
No ratings yet
Towards Generalizable Multi Camera 3d Object Detection Via 2rp452obxu
19 pages
Objectdetection
No ratings yet
Objectdetection
7 pages
Keypoint Recognition with Trees
No ratings yet
Keypoint Recognition with Trees
29 pages
Putt
No ratings yet
Putt
4 pages
Deep Object Pose Estimation For Semantic Robotic Grasping of Household Objects
No ratings yet
Deep Object Pose Estimation For Semantic Robotic Grasping of Household Objects
11 pages
Annex Application of The Klic Leader Training Course 2025 1
No ratings yet
Annex Application of The Klic Leader Training Course 2025 1
13 pages
201404011404445.philosophy and Aims of Education
No ratings yet
201404011404445.philosophy and Aims of Education
28 pages
Professional Resume
No ratings yet
Professional Resume
3 pages
Oscar Peterson
0% (1)
Oscar Peterson
9 pages
Peak Performance - Principles and Practices
100% (1)
Peak Performance - Principles and Practices
8 pages
Book Title: Integrated Devices and Circuits For Artificial Intelligence
No ratings yet
Book Title: Integrated Devices and Circuits For Artificial Intelligence
1 page
A Magic Square
100% (1)
A Magic Square
10 pages
Participation The New Tiranny
No ratings yet
Participation The New Tiranny
45 pages
Aniket Sidana: Web Developer Profile
No ratings yet
Aniket Sidana: Web Developer Profile
2 pages
Maharashtra Slection NSJDBDJKDND
No ratings yet
Maharashtra Slection NSJDBDJKDND
3 pages
Business Planning for Entrepreneurs
No ratings yet
Business Planning for Entrepreneurs
2 pages
Rezzaki S3G3 GHO65 I Dont Deserve ... To Whom It May Concern
No ratings yet
Rezzaki S3G3 GHO65 I Dont Deserve ... To Whom It May Concern
1 page
TOK Exhibition Final Submission
No ratings yet
TOK Exhibition Final Submission
5 pages
College List
No ratings yet
College List
3 pages
Solution Manual For Discrete Mathematics Mathematical Reasoning and Proof With Puzzles Patterns and Games 1st Edition
No ratings yet
Solution Manual For Discrete Mathematics Mathematical Reasoning and Proof With Puzzles Patterns and Games 1st Edition
5 pages
GENDER6 Dana
No ratings yet
GENDER6 Dana
16 pages
AL-ATHARIYA Prospectus (Latest)
No ratings yet
AL-ATHARIYA Prospectus (Latest)
13 pages
Pizza Chains: Targeting & Strategies
No ratings yet
Pizza Chains: Targeting & Strategies
33 pages
Ôn Tập Và Kiểm Tra Đánh Giá Tiếng Anh 7
No ratings yet
Ôn Tập Và Kiểm Tra Đánh Giá Tiếng Anh 7
60 pages
Letter To Ani
No ratings yet
Letter To Ani
3 pages
Public Health
No ratings yet
Public Health
15 pages
Stressors of Nursing Instructors
No ratings yet
Stressors of Nursing Instructors
5 pages
Emotional Intelligence & Personal Development
No ratings yet
Emotional Intelligence & Personal Development
22 pages
Assessment of Timbre Using Verbal Attributes
100% (2)
Assessment of Timbre Using Verbal Attributes
12 pages
Haryana CET Study Plan
No ratings yet
Haryana CET Study Plan
9 pages
8214
No ratings yet
8214
1 page
SP270 Syllabus (Japanese Animation: Still Pictures, Moving Minds)
100% (1)
SP270 Syllabus (Japanese Animation: Still Pictures, Moving Minds)
6 pages
Teachers Resource Catalogue Grade 12
No ratings yet
Teachers Resource Catalogue Grade 12
42 pages
Reviewer For Pre Finals Culminating Activity For The Humanities and Social Sciences
No ratings yet
Reviewer For Pre Finals Culminating Activity For The Humanities and Social Sciences
7 pages
Anshul Resume
No ratings yet
Anshul Resume
1 page

Object Detection and Localization Using Stereo Cameras

Uploaded by

Object Detection and Localization Using Stereo Cameras

Uploaded by

2020 5th International Conference on Advanced

Robotics and Mechatronics (ICARM)

Object Detection and Localization Using Stereo Cameras

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 628

Fig. 2. Depth map generate from Semi-global matching. Missing part of

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 629

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 630

&&$" C. Twice training strategy

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 631

depends on whether there is discrimination in image blocks.

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 632

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 633

You might also like

Object Detection and Localization Using Stereo Cameras

Uploaded by

Object Detection and Localization Using Stereo Cameras

Uploaded by

2020 5th International Conference on Advanced

Robotics and Mechatronics (ICARM)

Object Detection and Localization Using Stereo Cameras

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 628

Fig. 2. Depth map generate from Semi-global matching. Missing part of

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 629

   

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 630

&&$"  C. Twice training strategy

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 631

depends on whether there is discrimination in image blocks.

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 632

978-1-7281-6479-3/20/$31.00 ©2020 IEEE 633

You might also like

&&$" C. Twice training strategy