[go: up one dir, main page]

Skip to main content

Advertisement

Log in

KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Continuous sign language recognition (CSLR) is an intricate task aimed at transcribing sign language sequences from continuous video streams into sentences. Typically, deep learning-based CSLR systems are composed of a visual input encoder for feature extraction and a sequence learning model for the corresponding relationship between the input sequence and output sentence-level labels. The complex nature of sign language, characterized by an extensive vocabulary and many similar gestures and motions, renders CSLR particularly challenging. Additionally, the unsupervised nature of CSLR due to the unavailability of signing glosses for alignment necessitates detailed labeling of each word in a sentence, thereby limiting the amount of training data available. In this paper, we proposes a CSLR framework named KSRB-Net to address these critical problems. The proposed method incorporates a practical module that efficiently captures frame-wise motion information and spatio-temporal context information, which can be embedded into existing feature extraction modules. Additionally, a keyframe extraction algorithm based on the characteristics of the sign language dataset is designed to significantly accelerate the model training and reduce the risk of overfitting. Finally, connectionist temporal classification is employed as the objective function to capture the alignment proposal. The proposed method is validated on three datasets, namely the Chinese TJUT-SLRT, the Chinese USTC-CSL, and the German RWTH-Phoenix-Weather-2014. Experiment results demonstrate that the KSRB-Net achieves 98.40% accuracy and outperforms state-of-the-art methods in terms of efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability statement

The datasets used in this article are available on http://home.ustc.edu.cn/~pjh/openresources/cslr-dataset-2015/index.html, https://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX/ and http://ylr.tjut.edu.cn/kxyj.htm.

References

  1. Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4165–4174 (2019)

  2. Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)

    Article  Google Scholar 

  3. Yuan, T.: TJUT Sign Language Recognition and Translation Dataset. (2020). http://ylr.tjut.edu.cn/kxyj.htm

  4. Yang, Q., Jagannathan, S., Sun, Y.: Robust integral of neural network and error sign control of MIMO nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 26(12), 3278–3286 (2015)

    Article  MathSciNet  Google Scholar 

  5. Li, C., Xie, C., Zhang, B., Han, J., Zhen, X., Chen, J.: Memory attention networks for skeleton-based action recognition. IEEE Trans. Neural Netw. Learn. Syst. 33(9), 4800–4814 (2021)

    Article  Google Scholar 

  6. Slimane, F.B., Bouguessa, M.: Context matters: Self-attention for sign language recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7884–7891 (2021). IEEE

  7. Xiao, F., Liu, R., Yuan, T., Fan, Z., Wang, J., Zhang, J.: Slrformer: Continuous sign language recognition based on vision transformer. In: 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 1–7 (2022). IEEE Computer Society

  8. Liu, T., Zhou, W., Li, H.: Sign language recognition with long short-term memory. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2871–2875 (2016). IEEE

  9. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  10. Guo, D., Zhou, W., Wang, M., Li, H.: Sign language recognition based on adaptive hmms with data augmentation. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 2876–2880 (2016). IEEE

  11. Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: 2016 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2016). IEEE

  12. Guo, D., Zhou, W., Li, H., Wang, M.: Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(1), 1–18 (2017)

  13. Koller, O., Zargaran, S., Ney, H.: Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4297–4305 (2017)

  14. Koller, O., Zargaran, O., Ney, H., Bowden, R.: Deep sign: Hybrid cnn-hmm for continuous sign language recognition. In: Proceedings of the British Machine Vision Conference 2016 (2016)

  15. Cihan Camgoz, N., Hadfield, S., Koller, O., Bowden, R.: Subunets: End-to-end hand shape and continuous sign language recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3056–3065 (2017)

  16. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7361–7369 (2017)

  17. Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: IJCAI, vol. 3, p. 7 (2018)

  18. de Amorim, C.C., Macêdo, D., Zanchettin, C.: Spatial-temporal graph convolutional networks for sign language recognition. In: International Conference on Artificial Neural Networks, pp. 646–657 (2019). Springer

  19. Yin, K.: Sign language translation with transformers. arXiv preprint arXiv:2004.005882 (2020)

  20. Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: Joint end-to-end sign language recognition and translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10023–10033 (2020)

  21. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

  22. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

  23. Zhang, L., Zhu, G., Shen, P., Song, J., Afaq Shah, S., Bennamoun, M.: Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 3120–3128 (2017)

  24. Wu, D., Pigou, L., Kindermans, P.-J., Le, N.D.-H., Shao, L., Dambre, J., Odobez, J.-M.: Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1583–1597 (2016)

    Article  Google Scholar 

  25. Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)

  26. Xiao, F., Shen, C., Yuan, T., Chen, S.: CRB-Net: A sign language recognition deep learning strategy based on multi-modal fusion with attention mechanism. In: SMC, p. (2021)

  27. Chiu, C.-C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., et al: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778 (2018). IEEE

  28. Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., Yao, C., Bai, X.: Scene text recognition from two-dimensional perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8714–8721 (2019)

  29. Meng, F., Zhang, J.: Dtmt: A novel deep transition architecture for neural machine translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 224–231 (2019)

  30. Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)

  31. Ning, G., Zhang, Z., Huang, C., Ren, X., Wang, H., Cai, C., He, Z.: Spatially supervised recurrent convolutional neural networks for visual object tracking. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4 (2017). IEEE

  32. Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13009–13016 (2020)

  33. Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimed. 21(7), 1880–1891 (2019)

    Article  Google Scholar 

  34. Cheng, K.L., Yang, Z., Chen, Q., Tai, Y.-W.: Fully convolutional networks for continuous sign language recognition. In: European Conference on Computer Vision, pp. 697–714 (2020). Springer

  35. Pu, J., Zhou, W., Hu, H., Li, H.: Boosting continuous sign language recognition via cross modality augmentation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1497–1505 (2020)

  36. Koller, O., Camgoz, N.C., Ney, H., Bowden, R.: Weakly supervised learning with multi-stream CNN-LSTM-HMMS to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. 42(9), 2306–2320 (2019)

    Article  Google Scholar 

  37. Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  38. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

  39. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

  40. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015). PMLR

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianhua Zhang.

Ethics declarations

Conflicts of interest

The authors declared that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, F., Zhu, Y., Liu, R. et al. KSRB-Net: a continuous sign language recognition deep learning strategy based on motion perception mechanism. Vis Comput 40, 7845–7858 (2024). https://doi.org/10.1007/s00371-023-03211-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03211-3

Keywords