Fully Convolutional Networks for Continuous Sign Language Recognition

Ka Leong Cheng¹²,
Zhaoyang Yang¹³,
Qifeng Chen¹² &
…
Yu-Wing Tai^12,14

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12369))

Included in the following conference series:

European Conference on Computer Vision

4311 Accesses

Abstract

Continuous sign language recognition (SLR) is a challenging task that requires learning on both spatial and temporal dimensions of signing frame sequences. Most recent work accomplishes this by using CNN and RNN hybrid networks. However, training these networks is generally non-trivial, and most of them fail in learning unseen sequence patterns, causing an unsatisfactory performance for online recognition. In this paper, we propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences with only sentence-level annotations given. A gloss feature enhancement (GFE) module is introduced in the proposed network to enforce better sequence alignment learning. The proposed network is end-to-end trainable without any pre-training. We conduct experiments on two large scale SLR datasets. Experiments show that our method for continuous SLR is effective and performs well in online recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

Article 26 December 2023

Dual-stage temporal perception network for continuous sign language recognition

Article 08 June 2024

Global-Temporal Enhancement for Sign Language Recognition

References

Camgoz, N., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7784–7793 (2018)
Google Scholar
Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Subunets: end-to-end hand shape and continuous sign language recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3075–3084 (2017)
Google Scholar
Cooper, H., Bowden, R.: Learning signs from subtitles: a weakly supervised approach to sign language recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2574 (2009)
Google Scholar
Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1610–1618 (2017)
Google Scholar
Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimedia 21, 1880–1891 (2019)
Article Google Scholar
Evangelidis, G.D., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: Proceedings of European Conference on Computer Vision, pp. 595–607 (2015)
Google Scholar
Fang, G., Gao, W.: A SRN/HMM system for signer-independent continuous sign language recognition. In: Proceedings of IEEE International Conference on Automatic Face Gesture Recognition, pp. 312–317 (2002)
Google Scholar
Farhadi, A., Forsyth, D.: Aligning ASL for statistical translation using a discriminative word model. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1471–1476 (2006)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Guo, D., Zhou, W., Li, H., Wang, M.: Online early-late fusion based on adaptive HMM for sign language recognition. ACM Trans. Multimedia Comput. Communi. Appl. 14, 1–18 (2017)
Google Scholar
Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 6845–6852 (2018)
Google Scholar
Guo, D., Zhou, W., Wang, M., Li, H.: Sign language recognition based on adaptive HMMs with data augmentation. In: Proceedings of IEEE International Conference on Image Processing, pp. 2876–2880 (2016)
Google Scholar
Han, J., Awad, G., Sutherland, A.: Modelling and segmenting subunits for sign language recognition based on hand motion analysis. Pattern Recogn. Lett. 30, 623–633 (2009)
Article Google Scholar
Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 2257–2264 (2018)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Kelly, D., McDonald, J., Markham, C.: Recognizing spatiotemporal gestures and movement epenthesis in sign language. In: Proceedings of IEEE International Conference on Image Processing and Machine Vision, pp. 145–150 (2009)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR preprint CoRR:1412.6980 (2014)
Google Scholar
Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)
Article Google Scholar
Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3793–3802 (2016)
Google Scholar
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3416–3424 (2017)
Google Scholar
Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of British Machine Vision Conference, pp. 136.1–136.12 (2016)
Google Scholar
Liddell, S.K.: Grammar, Gestures, and Meaning in American Sign Language, pp. 52–53. Cambridge University Press, Cambridge (2003)
Google Scholar
Liwicki, M., Graves, A., Bunke, H., Schmidhuber, J.: A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of International Conference on Document Analysis and Recognition, pp. 367–371 (2007)
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
Google Scholar
Miao, Y., Gowayyed, M., Metze, F.: Eesen: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: IEEE Conference on Automatic Speech Recognition and Understanding Workshops, pp. 167–174 (2015)
Google Scholar
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
Google Scholar
Ong, S., Ranganath, S.: Automatic sign language analysis: a survey and the future beyond lexical meaning. IEEE Trans. Pattern Anal. Mach. Intell. 27, 873–91 (2005)
Article Google Scholar
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4594–4602 (2015)
Google Scholar
Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–6 (2011)
Google Scholar
Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Proceedings of International Joint Conference on Artificial Intelligence, pp. 885–891 (2018)
Google Scholar
Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4165–4174 (2019)
Google Scholar
Puigcerver, J.: Are multidimensional recurrent layers really necessary for handwritten text recognition? In: Proceedings of International Conference on Document Analysis and Recognition, pp. 67–72 (2017)
Google Scholar
Sak, H., Senior, A., Rao, K., İrsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4280–4284 (2015)
Google Scholar
Sun, C., Zhang, T., Bao, B.K., Xu, C., Mei, T.: Discriminative exemplar coding for sign language recognition with kinect. IEEE Trans. Cybern. 43, 1418–1428 (2013)
Article Google Scholar
Theodorakis, S., Katsamanis, A., Maragos, P.: Product-HMMs for automatic sign language recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1601–1604 (2009)
Google Scholar
Vela, A.H., et al.: Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in RGB-D. Pattern Recogn. Lett. 50, 112–121 (2014)
Article Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence - video to text. In: Proceedings of IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1494–1504 (2015)
Google Scholar
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2018)
Google Scholar
Yang, H.D., Lee, S.W.: Robust sign language recognition with hierarchical conditional random fields. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 2202–2205 (2010)
Google Scholar
Yang, R., Sarkar, S.: Detecting coarticulation in sign language using conditional random fields. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 108–112 (2006)
Google Scholar
Yang, R., Sarkar, S.: Gesture recognition using hidden Markov models from fragmented observations. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 766–773 (2006)
Google Scholar
Yang, W., Tao, J., Ye, Z.: Continuous sign language recognition using level building based on fast hidden Markov model. Pattern Recogn. Lett. 78, 28–35 (2016)
Article Google Scholar
Yang, Z., Shi, Z., Shen, X., Tai, Y.W.: SF-net: structured feature network for continuous sign language recognition. arXiv preprint arXiv:1908.01341 (2019)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Yin, F., Chai, X., Zhou, Y., Chen, X.: Weakly supervised metric learning towards signer adaptation for sign language recognition. In: Proceedings of British Machine Vision Conference, pp. 35.1–35.12 (2015)
Google Scholar
Zhang, J., Zhou, W., Li, H.: A threshold-based HMM-DTW approach for continuous sign language recognition. In: Proceedings of International Conference on Internet Multimedia Computing and Service, pp. 237–240 (2014)
Google Scholar
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive HMM. In: Proceedings of IEEE International Conference on Multimedia and Expo, pp. 1–6 (2016)
Google Scholar
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 13009–13016 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Ka Leong Cheng, Qifeng Chen & Yu-Wing Tai
Tencent, Shenzhen, China
Zhaoyang Yang
Kwai Inc., Beijing, China
Yu-Wing Tai

Authors

Ka Leong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhaoyang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Qifeng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Wing Tai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Wing Tai .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 17871 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, K.L., Yang, Z., Chen, Q., Tai, YW. (2020). Fully Convolutional Networks for Continuous Sign Language Recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://doi.org/10.1007/978-3-030-58586-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-58586-0_41
Published: 30 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58585-3
Online ISBN: 978-3-030-58586-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics