RGB-D Object Recognition Using Multi-Modal Deep Neural Network and DS Evidence Theory
<p>The flowchart of the proposed RGB-D object recognition method.</p> "> Figure 2
<p>The architecture of the proposed multi-modal network.</p> "> Figure 3
<p>The results of image scaling. (<b>a</b>) The RGB and depth images from the “ceteal_box” class; (<b>b</b>) the RGB and depth images from the “flashlight” class; (<b>c</b>) the RGB and depth images from the “cap” class; (<b>d</b>) the resized images of (<b>a</b>); (<b>e</b>) the resized images of (<b>b</b>); (<b>f</b>) the resized images of (<b>c</b>); (<b>g</b>) the scaled images of (<b>a</b>); (<b>h</b>) the scaled images of (<b>b</b>); (<b>i</b>) the scaled images of (<b>c</b>).</p> "> Figure 4
<p>Examples of the RGB images and the depth images. (<b>a</b>–<b>c</b>) Three samples from the class “orange”; (<b>d</b>–<b>f</b>) three samples from the class “tomato”; (<b>g</b>–<b>i</b>) three samples from the class “cereal_box”; (<b>j</b>–<b>l</b>) three samples from the class “toothpaste”.</p> "> Figure 4 Cont.
<p>Examples of the RGB images and the depth images. (<b>a</b>–<b>c</b>) Three samples from the class “orange”; (<b>d</b>–<b>f</b>) three samples from the class “tomato”; (<b>g</b>–<b>i</b>) three samples from the class “cereal_box”; (<b>j</b>–<b>l</b>) three samples from the class “toothpaste”.</p> "> Figure 5
<p>Objects of different categories from the Washington RGB-D object dataset.</p> "> Figure 6
<p>Examples of misclassified samples of the proposed method.</p> ">
Abstract
:1. Introduction
- The CNN based multi-modal deep neural network is built for learning RGB features and depth features. The training of the proposed multi-modal network has two stages. First, the RGB CNN and the depth CNN are trained, respectively. Then, the multi-modal feature learning network is trained to fine-tune the network parameters;
- we propose a quadruplet samples based objective function for each modality, which can learn the discriminative feature more effectively. Furthermore, we propose a comprehensive multi-modal objective function, which includes two discriminative terms and one correlation term; and
- for each modality, an effective weighted trust degree is designed according to the probability outputs of the two SVMs and the learned features. Then, the total trust degree can be computed using the Dempster rule of combination for object recognition.
2. Related Work
3. Proposed Method
3.1. RGB-D Image Preprocessing
3.2. Feature Learning Method of the Proposed Multi-Modal Network
3.2.1. The Architecture of the Proposed Multi-Modal Network
3.2.2. RGB Feature Learning and Depth Feature Learning
3.2.3. Multi-Modal Feature Learning
- Initialize the RGB CNN and the depth CNN with parameters from the AlexNet, which has been pre-trained on the ImageNet large scale dataset.
- Train the RGB CNN and the depth CNN, respectively, using the SGD method with back-propagation.For the RGB CNN,
- (1)
- Update according to Equation (2).
- (2)
- Update the parameters in the RGB CNN according to Equations (3)–(6).
- (3)
- Repeat (1)–(2) until convergence or the maximum number of iterations is reached.
Likewise, for the depth CNN. The parameter, , and the parameters in the RGB CNN are updated in turn. - Train the fusion network using the SGD method with back propagation.
- (1)
- Update according to Equation (14).
- (2)
- Update according to Equations (8) and (15).
- (3)
- Update according to Equations (9) and (16).
- (4)
- Update the parameters in the two CNNs according to Equations (17) and (18).Repeat (1)–(2) until convergence or the maximum number of iterations is reached.
3.3. RGB-D Object Recognition Based on DS Evidence Theory
4. Experimental Results
4.1. Dataset and Implementation Details
4.2. Comparasion with Different Baselines
- RGB CNN: Used the CNN for learning RGB features and added a softmax layer to the end of the network for classification.
- Depth CNN: Used the CNN for learning depth features and added a softmax layer to the end of the network for classification.
- RGB CNN+SVM: Only trained the RGB CNN using the RGB images, and used the SVM as the classifier.
- Depth CNN+SVM: Only trained the depth CNN using the depth images, and used the SVM as the classifier.
- RGB-D CNNs+Multi-modal learning: The RGB CNN and the depth CNN are first trained using the RGB images and the depth CNN, respectively. Then, we performed multi-modal learning using Equation (12). Finally, the RGB feature and the depth feature are connected directly, and we sent the connected feature to the SVM classifier for object recognition.
- RGB-D CNNs+DS fusion: The RGB CNN and the depth CNN were first trained, respectively. Then, the RGB feature and the depth feature were sent to the SVMs, respectively. Finally, our DS fusion strategy was used to fuse the two recognition results.
- RGB-D CNNs+Multi-modal learning+DS fusion: Our proposed method.
4.3. Comparasion with State-of-the-Art Methods
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Wong, S.C.; Stamatescu, V.; Gatt, A.; Kearney, D.; Lee, I.; McDonnell, M.D. Track Everything: Limiting Prior Knowledge in Online Multi-Object Recognition. IEEE Trans. Image Process. 2017, 26, 4669–4683. [Google Scholar] [CrossRef] [PubMed]
- Aldoma, A.; Tombari, F.; Stefano, L.D.; Vincze, M. A Global Hypothesis Verification Framework for 3D Object Recognition in Clutter. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1383–1396. [Google Scholar] [CrossRef] [PubMed]
- Oliveira, F.F.; Souza, A.A.F.; Fernandes, M.A.C.; Gomes, R.B.; Goncalves, L.M.G. Efficient 3D Objects Recognition Using Multifoveated Point Clouds. Sensors 2018, 18, 2302. [Google Scholar] [CrossRef] [PubMed]
- Chuang, M.C.; Hwang, J.N.; Williams, K. A Feature Learning and Object Recognition Framework for Underwater Fish Images. IEEE Trans. Image Process. 2016, 25, 1862–1872. [Google Scholar] [CrossRef] [PubMed]
- Gandarias, J.M.; Gómez-de-Gabriel, J.M.; García-Cerezo, A.J. Enhancing Perception with Tactile Object Recognition in Adaptive Grippers for Human–Robot Interaction. Sensors 2018, 18, 692. [Google Scholar] [CrossRef] [PubMed]
- Sanchez-Riera, J.; Hua, K.L.; Hsiao, Y.S.; Lim, T.; Hidayati, S.C.; Cheng, W.H. A comparative study of data fusion for RGB-D based visual recognition. Pattern Recognit. Lett. 2016, 73, 1–16. [Google Scholar] [CrossRef]
- Ren, L.; Lu, J.; Feng, J.; Zhou, J. Multi-modal uniform deep learning for RGB-D person re-identification. Pettern Recognit. 2017, 72, 446–457. [Google Scholar] [CrossRef]
- Xu, X.; Li, Y.; Wu, G.; Luo, J. Multi-modal deep feature learning for RGB-D object detection. Pattern Recognit. 2017, 72, 300–313. [Google Scholar] [CrossRef]
- Bai, J.; Wu, Y.; Zhang, J.; Chen, F. Subset based deep learning for RGB-D object recognition. Neurocomputing 2015, 165, 280–292. [Google Scholar] [CrossRef]
- Li, X.; Fang, M.; Zhang, J.J.; Wu, J. Learning coupled classifiers with RGB images for RGB-D object recognition. Pattern Recognit. 2017, 61, 433–446. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef] [Green Version]
- Bay, H.; Tuytelaars, T.; Gool, L.V. SURF: Speeded up Robust Features. In Proceedings of the European Conference on Computer Vision, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417. [Google Scholar]
- Johnson, A.E.; Hebert, M. Surface matching for object recognition in complex three-dimensional scenes. Image Vis. Comput. 1998, 16, 635–651. [Google Scholar] [CrossRef]
- Johnson, A.E.; Hebert, M. Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 21, 433–449. [Google Scholar] [CrossRef]
- Schwarz, M.; Schulz, H.; Behnke, S. RGB-D object recognition and pose estimation based on pre-trained convolutional neural network features. In Proceedings of the IEEE International Conference on Robotics and Automation, Seattle, WA, USA, 26–30 May 2015; pp. 1329–1335. [Google Scholar]
- Cheng, Y.; Zhao, X.; Huang, K.; Tan, T. Semi-supervised learning and feature evaluation for RGB-D object recognition. Comput. Vis. Image Underst. 2015, 139, 149–160. [Google Scholar] [CrossRef]
- Tang, H.; Su, Y.; Wang, J. Evidence theory and differential evolution based uncertainty quantification for buckling load of semi-rigid jointed Frames. Acad. Sci. 2015, 40, 1611–1627. [Google Scholar] [CrossRef]
- Wang, J.; Liu, F. Temporal evidence combination method for multi-sensor targetrecognition based on DS theory and IFS. J. Syst. Eng. Electron. 2017, 28, 1114–1125. [Google Scholar]
- Kuang, Y.; Li, L. Speech emotion recognition of decision fusion based on DS evidence theory. In Proceedings of the IEEE 4th International Conference on Software Engineering and Service Science, Beijing, China, 23–25 May 2013; pp. 795–798. [Google Scholar]
- Dong, G.; Kuang, G. Target Recognition via Information Aggregation through Dempster–Shafer’s Evidence Theory. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1247–1251. [Google Scholar] [CrossRef]
- Lai, K.; Bo, L.; Ren, X.; Fox, D. A large-scale hierarchical multi-view RGB-D object dataset. In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1817–1824. [Google Scholar]
- Bo, L.; Ren, X.; Fox, D. Depth kernel descriptors for object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 821–826. [Google Scholar]
- Yu, M.; Liu, L.; Shao, L. Structure-Preserving Binary Representations for RGB-D Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1651–1664. [Google Scholar] [CrossRef] [Green Version]
- Logoglu, K.B.; Kalkan, S.; Temizel, A. CoSPAIR: Colored Histograms of Spatial Concentric Surflet-Pairs for 3D object recognition. Robot. Auton. Syst. 2016, 75, 558–570. [Google Scholar] [CrossRef]
- Bo, L.; Ren, X.; Fox, D. Unsupervised Feature Learning for RGB-D Based Object Recognition. In Proceedings of the International Symposium on Experimental Robotics, Québec City, QC, Canada, 18–21 June 2012; pp. 387–402. [Google Scholar]
- Blum, M.; Springenberg, J.T.; Wülfing, J.; Riedmiller, M. A learned feature descriptor for object recognition in RGB-D data. In Proceedings of the IEEE International Conference on Robotics and Automation, St. Paul, MN, USA, 14–18 May 2012; pp. 1298–1303. [Google Scholar]
- Asif, U.; Bennamoun, M.; Sohel, F. Discriminative feature learning for efficient RGB-D object recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 272–279. [Google Scholar]
- Huang, Y.; Zhu, F.; Shao, L.; Frangi, A.F. Color Object Recognition via Cross-Domain Learning on RGB-D Images. In Proceedings of the IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; pp. 1672–1677. [Google Scholar]
- Li, F.; Liu, H.; Xu, X.; Sun, F. Multi-Modal Local Receptive Field Extreme Learning Machine for object recognition. In Proceedings of the International Joint Conference on Neural Networks, Vancouver, BC, Canada, 24–29 July 2016; pp. 1696–1701. [Google Scholar]
- Socher, R.; Huval, B.; Bhat, B.; Manning, C.D.; Ng, A.Y. Convolutional-Recursive Deep Learning for 3D Object Classification. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 656–664. [Google Scholar]
- Wang, A.; Lu, J.; Cai, J.; Cham, T.J.; Wang, G. Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition. IEEE Trans. Multimed. 2015, 17, 1887–1898. [Google Scholar] [CrossRef]
- Rahman, M.M.; Tan, Y.; Xue, J.; Lu, K. RGB-D object recognition with multimodal deep convolutional neural networks. In Proceedings of the IEEE International Conference on Multimedia and Expo, Hong Kong, China, 10–14 July 2017; pp. 991–996. [Google Scholar]
- Tang, L.; Yang, Z.X.; Jia, K. Canonical Correlation Analysis Regularization: An Effective Deep Multi-View Learning Baseline for RGB-D Object Recognition. IEEE Trans. Cogn. Dev. Syst. 2018. [Google Scholar] [CrossRef]
- Zia, S.; Yüksel, B.; Yüret, D.; Yemez, Y. RGB-D Object Recognition Using Deep Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 887–894. [Google Scholar]
- Song, S.; Xiao, J. Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 233–240. [Google Scholar]
- Meshgi, K.; Maeda, S.; Oba, S.; Skibbe, H.; Li, Y.; Ishii, S. Occlusion aware particle filter tracker to handle complex and persistent occlusions. Comput. Vis. Image Underst. 2016, 150, 81–94. [Google Scholar] [CrossRef]
- Camplani, M.; Hannuna, S.; Mirmehdi, M.; Damen, D.; Paiement, A.; Tao, L.; Burghardt, T. Real-time RGB-D Tracking with Depth Scaling Kernelised Correlation Filters and Occlusion Handling. In Proceedings of the British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 145.1–145.11. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 345–360. [Google Scholar]
- Gupta, S.; Arbeláez, P.; Malik, J. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 564–571. [Google Scholar]
- Song, S.; Lichtenberg, S.P.; Xiao, J. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 567–576. [Google Scholar]
- Gupta, S.; Hoffman, J.; Malik, J. Cross Modal Distillation for Supervision Transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2827–2836. [Google Scholar]
- Song, S.; Xiao, J. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 808–816. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Chen, W.; Chen, X.; Zhang, J.; Huang, K. Beyond Triplet loss: A deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 403–412. [Google Scholar]
- Zhang, D.; Zhao, L.; Xu, D.; Lu, D. Learning Local Feature Descriptor with Quadruplet Ranking Loss. In Proceedings of the CCF Chinese Conference on Computer Vision, Tianjin, China, 11–14 October 2017; pp. 206–217. [Google Scholar]
- Platt, J.C. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers; MIT Press: London, UK, June 2000; pp. 61–74. [Google Scholar]
- Bo, L.; Lai, K.; Ren, X.; Fox, D. Object recognition with hierarchical kernel descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 20–25 June 2011; pp. 1729–1736. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Wang, A.; Cai, J.; Lu, J.; Cham, T.J. MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1125–1133. [Google Scholar]
- Eitel, A.; Springenberg, J.T.; Spinello, L.; Riedmiller, M.; Burgard, W. Multimodal Deep Learning for Robust RGB-D Object Recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligence Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 681–687. [Google Scholar]
- Cheng, Y.; Cai, R.; Zhao, X.; Huang, K. Convolutional Fisher Kernels for RGB-D Object Recognition. In Proceedings of the International Conference on 3D Vision, Lyon, France, 19–22 October 2015; pp. 135–143. [Google Scholar]
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Mean | Var |
---|---|---|---|---|---|---|---|---|---|---|---|
92.9 | 92.7 | 90.1 | 91.9 | 92.2 | 90.4 | 93.1 | 90.2 | 91.7 | 92.8 | 91.8 | 1.4 |
Method | Accuracy (%) |
---|---|
RGB CNN | 85.7 ± 2.3 |
Depth CNN | 81.3 ± 2.2 |
RGB CNN+SVM | 87.5 ± 2.1 |
Depth CNN+SVM | 84.8 ± 2.0 |
RGB-D CNNs+Multi-modal learning | 90.2 ± 1.8 |
RGB-D CNNs+ DS fusion | 88.9 ± 1.9 |
RGB-D CNNs+Multi-modal learning+DS fusion | 91.8 ± 1.4 |
Method | Accuracy (%) | ||
---|---|---|---|
RGB | Depth | RGB-D | |
Linear SVM [21] | 74.3 ± 3.3 | 53.1 ± 1.7 | 81.9 ± 2.8 |
kSVM [21] | 74.5 ± 3.1 | 64.7 ± 2.2 | 83.8 ± 3.5 |
HKDES [48] | 76.1 ± 2.2 | 75.7 ± 2.6 | 84.1 ± 2.2 |
Kernel Descriptor [22] | 77.7 ± 1.9 | 78.8 ± 2.7 | 86.2 ± 2.1 |
CNN-RNN [30] | 80.8 ± 4.2 | 78.9 ± 3.8 | 86.8 ± 3.3 |
RGB-D HMP [25] | 82.4 ± 3.1 | 81.2 ± 2.3 | 87.5 ± 2.9 |
MMSS [49] | 74.6 ± 2.9 | 75.6 ± 2.7 | 88.5 ± 2.2 |
Fus-CNN (HHA) [50] | 84.1 ± 2.7 | 83.0 ± 2.7 | 91.0 ± 1.9 |
Fus-CNN (Jet) [50] | 84.1 ± 2.7 | 83.8 ± 2.7 | 91.3 ± 1.4 |
CFK [51] | 86.8 ± 2.2 | 85.8 ± 2.3 | 91.2 ± 1.5 |
MDCNN [32] | 87.9 ± 2.0 | 85.2 ± 2.1 | 92.2 ± 1.3 |
VGGnet + 3D CNN + VGG3D [34] | 88.9 ± 2.1 | 78.4 ± 2.4 | 91.8 ± 0.9 |
Our proposed method | 87.5 ± 2.1 | 84.8 ± 2.0 | 91.8 ± 1.4 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, H.; Yang, B.; Wang, X.; Liu, J.; Fu, D. RGB-D Object Recognition Using Multi-Modal Deep Neural Network and DS Evidence Theory. Sensors 2019, 19, 529. https://doi.org/10.3390/s19030529
Zeng H, Yang B, Wang X, Liu J, Fu D. RGB-D Object Recognition Using Multi-Modal Deep Neural Network and DS Evidence Theory. Sensors. 2019; 19(3):529. https://doi.org/10.3390/s19030529
Chicago/Turabian StyleZeng, Hui, Bin Yang, Xiuqing Wang, Jiwei Liu, and Dongmei Fu. 2019. "RGB-D Object Recognition Using Multi-Modal Deep Neural Network and DS Evidence Theory" Sensors 19, no. 3: 529. https://doi.org/10.3390/s19030529