Abstract
The goal of this work is to detect and track the articulated pose of a human in signing videos of more than one hour in length. In particular we wish to accurately localise hands and arms, despite fast motion and a cluttered and changing background.
We cast the problem as inference in a generative model of the image, and propose a complete model which accounts for self-occlusion of the arms. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large number of frames where configurations can be correctly inferred, and exploiting temporal tracking elsewhere.
Results are reported for signing footage with challenging image conditions and for different signers. We show that the method is able to identify the true arm and hand locations with high reliability. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Buchanan, A. M., & Fitzgibbon, A. W. (2006). Interactive feature tracking using k-d trees and dynamic programming. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 626–633).
Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.
Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.
Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. In ICCV, workshop human computer interaction (Vol. 4796, pp. 88–97).
Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 886–893).
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.
Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision 61(1), 55–79.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2000). Efficient matching of pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2066–2073).
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computer, c-22(1), 67–92.
Fleck, M. M., Forsyth, D. A., & Bregler, C. (1996). Finding naked people. In Lecture notes in computer science: Vol. 1065. Proceedings of the European conference on computer vision (pp. 591–602). Berlin: Springer.
Fossati, A., Dimitrijevic, M., Lepetit, V., & Fua, P. (2007). Bridging the gap between detection and tracking for 3D monocular video-based motion capture. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Jiang, H. (2009). Human pose estimation using consistent max-covering. In Proceedings of the international conference on computer vision.
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.
Kadir, T., Bowden, R., Ong, E. J., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Extending pictorial structures for object recognition. In Proceedings of the British machine vision conference.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). Learning layered motion segmentations of video. In Proceedings of the international conference on computer vision.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In Proceedings of the international conference on computer vision.
Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In Proceedings of the international conference on computer vision: Vol. 1.
Lee, M., & Cohen, I. (2006). A model-based approach for estimating human 3D poses in static images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 905–916.
Lin, Z., Davis, L., Doermann, D., & DeMenthon, D. (2007). An interactive approach to pose-assisted and appearance-based segmentation of humans. In ICCV, workshop on interactive computer vision.
Micilotta, A., Ong, E., & Bowden, R. (2005). Real-time upper body 3D pose estimation from a single uncalibrated camera.
Navaratnam, R., Thayananthan, A., Torr, P., & Cipolla, R. (2005). Hierarchical part-based human body pose estimation. In Proceedings of the British machine vision conference (pp. 479–488).
Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition (pp. 889–894).
Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems. Cambridge: MIT Press.
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 271–278).
Sheikh, Y., Datta, A., & Kanade, T. (2008). On the sustained tracking of human motion. In Proceedings of the international conference on automatic face and gesture recognition.
Siddiqui, M., & Medioni, G. (2007). Efficient upper body pose estimation from a single image or a sequence. In Human motion, lecture notes in computer science: Vol. 4814.
Sigal, L., & Black, M. (2006). Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2041–2048).
Sivic, J. Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference.
Starner, T., Weaver, J., & Pentland, A. (1998). Real-time American sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Tran, D., & Forsyth, D. (2007). Configuration estimates improve pedestrian finding. In Advances in neural information processing systems.
Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision, 1(2), 137–154.
Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In Proceedings of the European conference on computer vision.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Buehler, P., Everingham, M., Huttenlocher, D.P. et al. Upper Body Detection and Tracking in Extended Signing Sequences. Int J Comput Vis 95, 180–197 (2011). https://doi.org/10.1007/s11263-011-0480-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-011-0480-9