Upper Body Detection and Tracking in Extended Signing Sequences

Patrick Buehler¹,
Mark Everingham²,
Daniel P. Huttenlocher³ &
…
Andrew Zisserman¹

746 Accesses
6 Altmetric
Explore all metrics

Abstract

The goal of this work is to detect and track the articulated pose of a human in signing videos of more than one hour in length. In particular we wish to accurately localise hands and arms, despite fast motion and a cluttered and changing background.

We cast the problem as inference in a generative model of the image, and propose a complete model which accounts for self-occlusion of the arms. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large number of frames where configurations can be correctly inferred, and exploiting temporal tracking elsewhere.

Results are reported for signing footage with challenging image conditions and for different signers. We show that the method is able to identify the true arm and hand locations with high reliability. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-time hand detection using continuous skeletons

Article 01 April 2016

Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Imposing temporal consistency on deep monocular body shape and pose estimation

Article Open access 18 October 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.
Article Google Scholar
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Buchanan, A. M., & Fitzgibbon, A. W. (2006). Interactive feature tracking using k-d trees and dynamic programming. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 626–633).
Google Scholar
Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.
Google Scholar
Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. In ICCV, workshop human computer interaction (Vol. 4796, pp. 88–97).
Google Scholar
Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 886–893).
Google Scholar
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.
Google Scholar
Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Google Scholar
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision 61(1), 55–79.
Article Google Scholar
Felzenszwalb, P. F., & Huttenlocher, D. P. (2000). Efficient matching of pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2066–2073).
Google Scholar
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computer, c-22(1), 67–92.
Article Google Scholar
Fleck, M. M., Forsyth, D. A., & Bregler, C. (1996). Finding naked people. In Lecture notes in computer science: Vol. 1065. Proceedings of the European conference on computer vision (pp. 591–602). Berlin: Springer.
Google Scholar
Fossati, A., Dimitrijevic, M., Lepetit, V., & Fua, P. (2007). Bridging the gap between detection and tracking for 3D monocular video-based motion capture. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).
Google Scholar
Jiang, H. (2009). Human pose estimation using consistent max-covering. In Proceedings of the international conference on computer vision.
Google Scholar
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.
Google Scholar
Kadir, T., Bowden, R., Ong, E. J., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.
Google Scholar
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Extending pictorial structures for object recognition. In Proceedings of the British machine vision conference.
Google Scholar
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). Learning layered motion segmentations of video. In Proceedings of the international conference on computer vision.
Google Scholar
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In Proceedings of the international conference on computer vision.
Google Scholar
Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In Proceedings of the international conference on computer vision: Vol. 1.
Google Scholar
Lee, M., & Cohen, I. (2006). A model-based approach for estimating human 3D poses in static images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 905–916.
Article Google Scholar
Lin, Z., Davis, L., Doermann, D., & DeMenthon, D. (2007). An interactive approach to pose-assisted and appearance-based segmentation of humans. In ICCV, workshop on interactive computer vision.
Google Scholar
Micilotta, A., Ong, E., & Bowden, R. (2005). Real-time upper body 3D pose estimation from a single uncalibrated camera.
Navaratnam, R., Thayananthan, A., Torr, P., & Cipolla, R. (2005). Hierarchical part-based human body pose estimation. In Proceedings of the British machine vision conference (pp. 479–488).
Google Scholar
Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition (pp. 889–894).
Google Scholar
Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems. Cambridge: MIT Press.
Google Scholar
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 271–278).
Google Scholar
Sheikh, Y., Datta, A., & Kanade, T. (2008). On the sustained tracking of human motion. In Proceedings of the international conference on automatic face and gesture recognition.
Google Scholar
Siddiqui, M., & Medioni, G. (2007). Efficient upper body pose estimation from a single image or a sequence. In Human motion, lecture notes in computer science: Vol. 4814.
Google Scholar
Sigal, L., & Black, M. (2006). Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2041–2048).
Google Scholar
Sivic, J. Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference.
Google Scholar
Starner, T., Weaver, J., & Pentland, A. (1998). Real-time American sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Article Google Scholar
Tran, D., & Forsyth, D. (2007). Configuration estimates improve pedestrian finding. In Advances in neural information processing systems.
Google Scholar
Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision, 1(2), 137–154.
Google Scholar
Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In Proceedings of the European conference on computer vision.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Engineering Science, University of Oxford, Oxford, UK
Patrick Buehler & Andrew Zisserman
School of Computing, University of Leeds, Leeds, UK
Mark Everingham
Computer Science Department, Cornell University, Cornell, USA
Daniel P. Huttenlocher

Authors

Patrick Buehler
View author publications
You can also search for this author in PubMed Google Scholar
Mark Everingham
View author publications
You can also search for this author in PubMed Google Scholar
Daniel P. Huttenlocher
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Everingham.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Buehler, P., Everingham, M., Huttenlocher, D.P. et al. Upper Body Detection and Tracking in Extended Signing Sequences. Int J Comput Vis 95, 180–197 (2011). https://doi.org/10.1007/s11263-011-0480-9

Download citation

Received: 08 November 2009
Accepted: 24 June 2011
Published: 12 July 2011
Issue Date: November 2011
DOI: https://doi.org/10.1007/s11263-011-0480-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Real-time hand detection using continuous skeletons

Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Imposing temporal consistency on deep monocular body shape and pose estimation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Upper Body Detection and Tracking in Extended Signing Sequences

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Real-time hand detection using continuous skeletons

Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

Imposing temporal consistency on deep monocular body shape and pose estimation

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now