Multi-scale Deep Learning for Gesture Detection and Localization

Natalia Neverova^16,17,
Christian Wolf^16,17,
Graham W. Taylor¹⁸ &
…
Florian Nebout¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8925))

Included in the following conference series:

European Conference on Computer Vision

7046 Accesses
50 Citations

Abstract

We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams.

Download to read the full chapter text

Chapter PDF

On Multi-modal Fusion for Freehand Gesture Recognition

Lightweight Deep Learning Models for Robust Hand Gesture Recognition

Gesture MNIST: A New Free-Hand Gesture Dataset

Keywords

References

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Google Scholar
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In: ICLR (2014)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)
Google Scholar
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning Hierarchical Features for Scene Labeling. PAMI 35(8), 1915–1929 (2013)
Article Google Scholar
Couprie, C., Clment, F., Najman, L., LeCun, Y.: Indoor Semantic Segmentation using depth information. In: ICLR (2014)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, C., Memisevic, R., Vincent, P., Courville, A., Bengio, Y.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI (2013)
Google Scholar
aigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: CVPR (2014)
Google Scholar
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In: BMVC (2012)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale Video Classification with Convolutional Neural Networks. In: CVPR (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: arXiv preprint arXiv:1406.2199v1 (2014)
Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn Looking at People Challenge 2014: Dataset and Results. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. BMVC 124(1-124), 11 (2009)
Google Scholar
Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse Spatio-Temporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)
Google Scholar
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
Google Scholar
Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)
Google Scholar
Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)
Chapter Google Scholar
Keskin, C., Kiraç, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshop on Consumer Depth Cameras. IEEE (2011)
Google Scholar
Półrola, M., Wojciechowski, A.: Real-Time Hand Pose Estimation Using Classifiers. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2012. LNCS, vol. 7594, pp. 573–580. Springer, Heidelberg (2012)
Chapter Google Scholar
Tang, D., Yu, T.H., Kim, T.K.: Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. In: ICCV (2013)
Google Scholar
Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transaction on Graphics (2014)
Google Scholar
Oikonomidis, I., Kyriazis, N., Argyros, A.: Efficient model-based 3D tracking of hand articulations using Kinect. BMVC 101(1–101), 11 (2011)
Google Scholar
Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and Robust Hand Tracking from Depth. In: CVPR (2014)
Google Scholar
Wang, F., Li, Y.: Beyond Physical Connections: Tree Models in Human Pose Estimation. In: CVPR (2013)
Google Scholar
Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In: CVPR (2014)
Google Scholar
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)
Google Scholar
Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured Human Activity Detection from RGBD Images. In: ICRA (2012)
Google Scholar
Chen, X., Koskela, M.: Online RGB-D gesture recognition with extreme learning machines. In: ICMI (2013)
Google Scholar
Nandakumar, K., Wah, W.K., Alice, C.S.M., Terence, N.W.Z., Gang, W.J., Yun, Y.W.: A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors. In: 2013 Multi-modal Challenge Workshop in Conjunction with ICMI (2013)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)
Google Scholar
Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: CVPR (2007)
Google Scholar
Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant Spatio-Temporal Features from Video. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional Neural Networks for Human Action Recognition. PAMI 35(1), 221–231 (2013)
Article Google Scholar
Ngiam, J., Khosla, A., Kin, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Google Scholar
Srivastava, N., Salakhutdinov, R.: Multimodal learning with Deep Boltzmann Machines. In: NIPS (2013)
Google Scholar
Neverova, N., Wolf, C., Paci, G., Sommavilla, G., Taylor, G.W., Nebout, F.: A multi-scale approach to gesture detection and recognition. In: ICCV Workshop on Understanding Human Activities: Context and Interactions (HACI) (2013)
Google Scholar
Zanfir, M., Leordeanu, M., Sminchisescu, C.: The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In: ICCV (2013)
Google Scholar
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICMlL (2009)
Google Scholar
Wu, D.: Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Monnier, C., German, S., Ost, A.: A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Camgoz, N.C., Kindiroglu, A.A., Akarun, L.: Gesture Recognition using Template Based Random Forest Classifiers. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Chang, J.Y.: Nonparametric Gesture Labeling from Multi-modal Data. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Peng, X., Wang, L., Cai, Z.: Action and Gesture Temporal Spotting with Super Vector Representation. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Pigou, L., Dieleman, S., Kindermans, P.J.: Sign Language Recognition Using Convolutional Neural Networks. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., Knoll, A.: Multi-modality Gesture Detection and Recognition With Un-supervision, Randomization and Discrimination. In: ECCV ChaLearn Workshop on Looking at People (2014)
Google Scholar
Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006)
Article MATH Google Scholar
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984)
Google Scholar

Download references

Author information

Authors and Affiliations

Université de Lyon, CNRS, Lyon, France
Natalia Neverova & Christian Wolf
INSA-Lyon, LIRIS, UMR5205, 69621, Villeurbanne cedex, France
Natalia Neverova & Christian Wolf
University of Guelph, Guelph, Canada
Graham W. Taylor
Awabot, Lyon, France
Florian Nebout

Authors

Natalia Neverova
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wolf
View author publications
You can also search for this author in PubMed Google Scholar
Graham W. Taylor
View author publications
You can also search for this author in PubMed Google Scholar
Florian Nebout
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalia Neverova .

Editor information

Editors and Affiliations

University College London, London, United Kingdom
Lourdes Agapito
University of Lugano, Lugano, Switzerland
Michael M. Bronstein
Technische Universität Dresden, Dresden, Germany
Carsten Rother

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Neverova, N., Wolf, C., Taylor, G.W., Nebout, F. (2015). Multi-scale Deep Learning for Gesture Detection and Localization. In: Agapito, L., Bronstein, M., Rother, C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science(), vol 8925. Springer, Cham. https://doi.org/10.1007/978-3-319-16178-5_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-16178-5_33
Published: 19 March 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16177-8
Online ISBN: 978-3-319-16178-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics