[go: up one dir, main page]

skip to main content
article
Free access

Convolutional nets and watershed cuts for real-time semantic Labeling of RGBD videos

Published: 01 January 2014 Publication History

Abstract

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

References

[1]
Cédric Allène, Jean-Yves Audibert, Michel Couprie, and Renaud Keriven. Some links between extremum spanning forests, watersheds and min-cuts. Image and Vision Computing , 28(10):1460-1471, 2010.
[2]
César Cadena and Jana Košecka. Semantic parsing for priming object detection in RGB-D scenes. In 3rd Workshop on Semantic Perception, Mapping and Exploration, 2013.
[3]
Dan Claudiu Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. Flexible, high performance convolutional neural networks for image classification. In Proc. of the 22nd International Joint Conference on Artificial Intelligence, pages 1237-1242, 2011.
[4]
Dan Claudiu Ciresan, Alessandro Giusti, Luca Maria Gambardella, and Jürgen Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Conference on Neural Information Processing Systems, pages 2852-2860, 2012.
[5]
Dan Claudiu Ciresan, Alessandro Giusti, Luca Maria Gambardella, and Jürgen Schmidhuber. Mitosis detection in breast cancer histology images using deep neural networks. In MICCAI, 2013.
[6]
R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like environment for machines learning. In NIPS Big Learning Workshop, Sierra Nevada, Spain, 2011.
[7]
Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 24:603-619, 2002.
[8]
Camille Couprie. Multi-label energy minimization for object class segmentation. In 20th European Signal Processing Conference, Bucharest, Romania, August 2012.
[9]
Camille Couprie. Source code for causal graph-based video segmentation, 2013. www.esiee. fr/~coupriec/code.html.
[10]
Camille Couprie, Leo J. Grady, Laurent Najman, and Hugues Talbot. Power watershed: A unifying graph-based optimization framework. IEEE Trans. Pattern Analysis and Machine Intelligence, 33(7):1384-1399, 2011.
[11]
Camille Couprie, Clément Farabet, Yann LeCun, and Laurent Najman. Causal graph-based video segmentation. In Proc. of IEEE International Conference on Image Processing, 2013a.
[12]
Camille Couprie, Clément Farabet, Laurent Najman, and Yann LeCun. Indoor semantic segmentation using depth information. In International Conference on Learning Representations , 2013b.
[13]
Jean Cousty, Gilles Bertrand, Laurent Najman, and Michel Couprie. Watershed cuts: Minimum spanning forests and the drop of water. IEEE Trans. Pattern Analysis and Machine Intelligence, 31(8):1362-1374, 2009.
[14]
Leandro Cruz, Djalma Lucio, and Luiz Velho. Kinect and RGBD images: Challenges and applications. SIBGRAPI Tutorial, 2012.
[15]
Anat Levin Dani, Dani Lischinski, and Yair Weiss. Colorization using optimization. ACM Transactions on Graphics, 23:689-694, 2004.
[16]
Clément Farabet. Towards Real-Time Image Understanding with Convolutional Networks. PhD thesis, Université Paris Est, 2014.
[17]
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers. In Proc. of International Conference on Machine Learning, Edinburgh, Scotland, June 2012.
[18]
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE Trans. on Pattern Analysis and Machine Intelligence, 35(8):1915-1929, Aug 2013.
[19]
Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, 59:167-181, 2004.
[20]
Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193-202, 1980.
[21]
Daniel Glasner, Shiv Naga Prasad Vitaladevuni, and Ronen Basri. Contour-based joint clustering of multiple segmentations. In Proc. of IEEE Computer Vision and Pattern Recognition, pages 2385-2392, Washington, DC, USA, 2011.
[22]
Cristina Gomila and Fernand Meyer. Graph-based object tracking. In Proc. of IEEE International Conference on Image Processing, 2003.
[23]
Matthias Grundmann, Vivek Kwatra, Mei Han, and Irfan Essa. Efficient hierarchical graph-based video segmentation. In Proc. of IEEE Computer Vision and Pattern Recognition, 2010.
[24]
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
[25]
Derek Hoiem, Alexei A. Efros, and Martial Hebert. Geometric context from a single image. In Proc. of IEEE International Conference on Computer Vision, volume 1, pages 654-661, 2005.
[26]
Navdeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke. Application of pretrained deep neural networks to large vocabulary speech recognition. In Proc. of Interspeech, 2012.
[27]
Allison Janoch, Sergey Karayev, Yangqing Jia, Jonathan T. Barron, Mario Fritz, Kate Saenko, and Trevor Darrell. A category-level 3-D object dataset: Putting the kinect to work. In IEEE International Conference on Computer Vision Workshops, pages 1168-1174, 2011.
[28]
Armand Joulin, Francis Bach, and Jean Ponce. Multi-class cosegmentation. In Proc. of IEEE Computer Vision and Pattern Recognition, pages 542-549, 2012.
[29]
Hema S. Koppula, Abhishek Anand, Thorsten Joachims, and Ashutosh Saxena. Cornell-RGBD-dataset, 2009. http://pr.cs.cornell.edu/sceneunderstanding/data/data.php.
[30]
Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11):2278-2324, nov 1998.
[31]
Yann LeCun, Fu-Jie Huang, and Leon Bottou. Learning Methods for generic object recognition with invariance to pose and lighting. In Proc. of IEEE Computer Vision and Pattern Recognition, 2004.
[32]
J. Lee, JungHwan Oh, and Sae Hwang. Clustering of video objects by graph matching. In Proc. of IEEE International Conference on Multimedia and Expo, pages 394-397, 2005.
[33]
Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for detecting robotic grasp, corr abs/1301.3592s. In Robotics: Science and Systems, 2013.
[34]
Fernand Meyer. Minimum spanning forests for morphological segmentation. In Proc. of International Symposium on Mathematical Morphology, pages 77-84, 1994.
[35]
Ondrej Miksik, Daniel Munoz, J. Andrew Bagnell, and Martial Hebert. Efficient temporal consistency for streaming video scene analysis. In Proc. of the IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 2013.
[36]
O.J. Morris, M. de Jersey Lee, and A.G. Constantinides. Graph theory for image analysis: an approach based on the shortest spanning tree. Communications, Radar and Signal Processing, IEE Proceedings F, 133(2):146-152, April 1986.
[37]
Andreas C Müller and Sven Behnke. Learning depth-sensitive conditional random fields for semantic segmentation of RGB-D images. In IEEE International Conference on Robotics and Automation, Hong Kong, May, 2014.
[38]
Sylvain Paris. Edge-preserving smoothing and mean-shift segmentation of video streams. In Proc. of IEEE European Conference on Computer Vision, pages 460-473, Marseille, France, 2008.
[39]
Xiaofeng Ren, Liefeng Bo, and D. Fox. RGB-(D) scene labeling: Features and algorithms. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition, pages 2759 -2766, june 2012.
[40]
Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2:1019-1025, 1999.
[41]
Hannes Schulz and Sven Behnke. Learning object-class segmentation with convolutional neural networks. In 11th European Symposium on Artificial Neural Networks, 2012.
[42]
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. In IEEE Trans. on Pattern Analysis and Machine Intelligence, volume 22, pages 888-905, 1997.
[43]
Nathan Silberman and Rob Fergus. Indoor scene segmentation using a structured light sensor. In 3DRR Workshop, IEEE International Conference on Computer Vision Workshops, 2011.
[44]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from RGBD images. In Proc. of IEEE European Conference on Computer Vision, 2012.
[45]
Ali Kemal Sinop and Leo Grady. A seeded image segmentation framework unifying graph cuts and random walker which yields a new algorithm. In Proc. of International Conference of Computer Vision, 2007.
[46]
Richard Socher, Brody Huval, Bharath Bhat, Christopher D. Manning, and Andrew Y. Ng. Convolutional-Recursive Deep Learning for 3D Object Classification. In Advances in Neural Information Processing Systems 25, 2012.
[47]
Jörg Stückler, Benedikt Waldvogel, Hannes Schulz, and Sven Behnke. Dense real-time mapping of object-class semantics from RGB-D video. Journal of Real-Time Image Processing , pages 1-11, 2013.
[48]
Olga Veksler, Yuri Boykov, and Paria Mehrani. Superpixels and supervoxels in an energy optimization framework. In Proc. of IEEE European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11 (5), pages 211-224, 2010.
[49]
Chenliang Xu, Caiming Xiong, and Jason J. Corso. Streaming hierarchical video segmentation. In Proc. of IEEE European Conference on Computer Vision, Florence, Italy, October 7-13, pages 626-639, 2012.

Cited By

View all
  • (2021)Real-time human detection for robots using CNN with a feature-based layered pre-filter2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)10.1109/ROMAN.2016.7745248(1120-1125)Online publication date: 11-Mar-2021
  • (2021)Augmenting deep convolutional neural networks with depth-based layered detection for human detection2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS.2016.7759182(1073-1078)Online publication date: 11-Mar-2021
  • (2019)Region Merging Driven by Deep Learning for RGB-D Segmentation and LabelingProceedings of the 13th International Conference on Distributed Smart Cameras10.1145/3349801.3349810(1-6)Online publication date: 9-Sep-2019
  • Show More Cited By
  1. Convolutional nets and watershed cuts for real-time semantic Labeling of RGBD videos

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image The Journal of Machine Learning Research
      The Journal of Machine Learning Research  Volume 15, Issue 1
      January 2014
      4085 pages
      ISSN:1532-4435
      EISSN:1533-7928
      Issue’s Table of Contents

      Publisher

      JMLR.org

      Publication History

      Revised: 01 May 2014
      Published: 01 January 2014
      Published in JMLR Volume 15, Issue 1

      Author Tags

      1. convolutional networks
      2. deep learning
      3. depth information
      4. optimization
      5. superpixels

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)38
      • Downloads (Last 6 weeks)9
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Real-time human detection for robots using CNN with a feature-based layered pre-filter2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)10.1109/ROMAN.2016.7745248(1120-1125)Online publication date: 11-Mar-2021
      • (2021)Augmenting deep convolutional neural networks with depth-based layered detection for human detection2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS.2016.7759182(1073-1078)Online publication date: 11-Mar-2021
      • (2019)Region Merging Driven by Deep Learning for RGB-D Segmentation and LabelingProceedings of the 13th International Conference on Distributed Smart Cameras10.1145/3349801.3349810(1-6)Online publication date: 9-Sep-2019
      • (2017)Segmentation and semantic labelling of RGBD data with convolutional neural networks and surface fittingIET Computer Vision10.1049/iet-cvi.2016.050211:8(633-642)Online publication date: 20-Sep-2017
      • (2017)Multi-modal deep feature learning for RGB-D object detectionPattern Recognition10.1016/j.patcog.2017.07.02672:C(300-313)Online publication date: 1-Dec-2017
      • (2016)Semantic video labeling by developmental visual agentsComputer Vision and Image Understanding10.1016/j.cviu.2016.02.011146:C(9-26)Online publication date: 1-May-2016
      • (undefined)Lesion border detection using deep learning2016 IEEE Congress on Evolutionary Computation (CEC)10.1109/CEC.2016.7743955(1416-1421)

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media