Abstract
We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams.
Chapter PDF
Similar content being viewed by others
References
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In: ICLR (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning Hierarchical Features for Scene Labeling. PAMI 35(8), 1915–1929 (2013)
Couprie, C., Clment, F., Najman, L., LeCun, Y.: Indoor Semantic Segmentation using depth information. In: ICLR (2014)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., Gülçehre, C., Memisevic, R., Vincent, P., Courville, A., Bengio, Y.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI (2013)
aigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: CVPR (2014)
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In: BMVC (2012)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale Video Classification with Convolutional Neural Networks. In: CVPR (2014)
Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: arXiv preprint arXiv:1406.2199v1 (2014)
Escalera, S., Baró, X., Gonzàlez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn Looking at People Challenge 2014: Dataset and Results. In: ECCV ChaLearn Workshop on Looking at People (2014)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. BMVC 124(1-124), 11 (2009)
Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse Spatio-Temporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)
Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
Kläser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)
Willems, G., Tuytelaars, T., Van Gool, L.: An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)
Keskin, C., Kiraç, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshop on Consumer Depth Cameras. IEEE (2011)
Półrola, M., Wojciechowski, A.: Real-Time Hand Pose Estimation Using Classifiers. In: Bolc, L., Tadeusiewicz, R., Chmielewski, L.J., Wojciechowski, K. (eds.) ICCVG 2012. LNCS, vol. 7594, pp. 573–580. Springer, Heidelberg (2012)
Tang, D., Yu, T.H., Kim, T.K.: Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. In: ICCV (2013)
Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transaction on Graphics (2014)
Oikonomidis, I., Kyriazis, N., Argyros, A.: Efficient model-based 3D tracking of hand articulations using Kinect. BMVC 101(1–101), 11 (2011)
Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and Robust Hand Tracking from Depth. In: CVPR (2014)
Wang, F., Li, Y.: Beyond Physical Connections: Tree Models in Human Pose Estimation. In: CVPR (2013)
Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In: CVPR (2014)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)
Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured Human Activity Detection from RGBD Images. In: ICRA (2012)
Chen, X., Koskela, M.: Online RGB-D gesture recognition with extreme learning machines. In: ICMI (2013)
Nandakumar, K., Wah, W.K., Alice, C.S.M., Terence, N.W.Z., Gang, W.J., Yun, Y.W.: A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors. In: 2013 Multi-modal Challenge Workshop in Conjunction with ICMI (2013)
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)
Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: CVPR (2007)
Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant Spatio-Temporal Features from Video. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D Convolutional Neural Networks for Human Action Recognition. PAMI 35(1), 221–231 (2013)
Ngiam, J., Khosla, A., Kin, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with Deep Boltzmann Machines. In: NIPS (2013)
Neverova, N., Wolf, C., Paci, G., Sommavilla, G., Taylor, G.W., Nebout, F.: A multi-scale approach to gesture detection and recognition. In: ICCV Workshop on Understanding Human Activities: Context and Interactions (HACI) (2013)
Zanfir, M., Leordeanu, M., Sminchisescu, C.: The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In: ICCV (2013)
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICMlL (2009)
Wu, D.: Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
Monnier, C., German, S., Ost, A.: A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
Camgoz, N.C., Kindiroglu, A.A., Akarun, L.: Gesture Recognition using Template Based Random Forest Classifiers. In: ECCV ChaLearn Workshop on Looking at People (2014)
Chang, J.Y.: Nonparametric Gesture Labeling from Multi-modal Data. In: ECCV ChaLearn Workshop on Looking at People (2014)
Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV ChaLearn Workshop on Looking at People (2014)
Peng, X., Wang, L., Cai, Z.: Action and Gesture Temporal Spotting with Super Vector Representation. In: ECCV ChaLearn Workshop on Looking at People (2014)
Pigou, L., Dieleman, S., Kindermans, P.J.: Sign Language Recognition Using Convolutional Neural Networks. In: ECCV ChaLearn Workshop on Looking at People (2014)
Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., Knoll, A.: Multi-modality Gesture Detection and Recognition With Un-supervision, Randomization and Discrimination. In: ECCV ChaLearn Workshop on Looking at People (2014)
Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Neverova, N., Wolf, C., Taylor, G.W., Nebout, F. (2015). Multi-scale Deep Learning for Gesture Detection and Localization. In: Agapito, L., Bronstein, M., Rother, C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science(), vol 8925. Springer, Cham. https://doi.org/10.1007/978-3-319-16178-5_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-16178-5_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16177-8
Online ISBN: 978-3-319-16178-5
eBook Packages: Computer ScienceComputer Science (R0)