Novel Spatio-Temporal Continuous Sign Language Recognition Using an Attentive Multi-Feature Network
<p>The overall architecture of the proposed method consists of three components: a spatial module, a temporal module, and a sequence learning module. The spatial module first takes the image sequence to extract frame-wise features and then applies the temporal module to extract the temporal features. Then, the temporal features are sent to the sequence learning module to perform word prediction and construct it into a sentence.</p> "> Figure 2
<p>The spatial module architecture uses multi-stream input. RGB stream as a full-frame feature and keypoints stream as a keypoint features.</p> "> Figure 3
<p>Full-frame feature using RGB image, the (<b>left</b> image) is the original image, and the (<b>right</b> image) is the cropped image to adjust with the proposed model.</p> "> Figure 4
<p>Keypoint features of PHOENIX-RWTH dataset [<a href="#B33-sensors-22-06452" class="html-bibr">33</a>,<a href="#B39-sensors-22-06452" class="html-bibr">39</a>], (<b>left</b> image) extraction from RGB image, and the (<b>right</b> image) is the selected keypoint used by the proposed model.</p> "> Figure 5
<p>Temporal module architecture consists of a stacked 1D-CNN and pooling layer embedded with attention module. Work in parallel for both feature streams, concatenated at the end of the stacked layers, and produce a single temporal feature with a sequence length four times smaller.</p> "> Figure 6
<p>Illustration of defect frames on RWTH-PHOENIX dataset [<a href="#B33-sensors-22-06452" class="html-bibr">33</a>,<a href="#B39-sensors-22-06452" class="html-bibr">39</a>]. Some of the keypoints in the hand area are in the wrong position due to blurry images.</p> "> Figure 7
<p>Attention modules are embedded in spatial and temporal modules in different configurations.</p> "> Figure 8
<p>Qualitative evaluation of the recognition result using a different configuration of the RWTH-PHOENIX dataset [<a href="#B33-sensors-22-06452" class="html-bibr">33</a>,<a href="#B39-sensors-22-06452" class="html-bibr">39</a>]. The wrong predicted glosses are marked in red.</p> ">
Abstract
:1. Introduction
- We introduce novel temporal attention into the sequence module to capture the important time points that contribute to the final output;
- We introduce the multi-feature that consists of the full-frame feature from the RGB value of the frame as the main feature and keypoint features that includes the body pose with the hand shape detail as an additional feature to enhance model recognition performance;
- We use the WER metric to show that our proposed STAMF model outperforms state-of-the-art models on both CSLR benchmark datasets through the experiments.
2. Related Works
3. Proposed Method
3.1. Framework Overview
3.2. Spatial Module
3.2.1. Full-Frame Feature
3.2.2. Keypoint Features
3.3. Temporal Module
3.4. Attention Module
3.5. Sequence Learning
4. Experimental Results and Discussion
4.1. Datasets
4.1.1. CSL Dataset
4.1.2. RWTH-PHOENIX Dataset
4.2. Data Preprocessing
4.3. Evaluation Metric
4.4. Experiment on Input Streams
4.5. Experiment on the Attention Module
4.6. Ablation Experiment on the STAMF Network
4.7. Experiment on the STAMF Network
4.7.1. Implementation Details
4.7.2. Quantitative Result
4.7.3. Qualitative Result
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Dreuw, P.; Rybach, D.; Deselaers, T.; Zahedi, M.; Ney, H. Speech Recognition Techniques for a Sign Language Recognition System. In Proceedings of the INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27–31 August 2007; pp. 2513–2516. [Google Scholar]
- Ong, S.C.W.; Ranganath, S. Automatic sign language analysis: A Survey and the Future Beyond Lexical Meaning. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 873–891. [Google Scholar] [CrossRef] [PubMed]
- Vogler, C.; Metaxas, D. A Framework for Recognizing the Simultaneous Aspects of American Sign Language. Comput. Vis. Image Underst. 2001, 81, 358–384. [Google Scholar] [CrossRef]
- Bowden, R.; Windridge, D.; Kadir, T.; Zisserman, A.; Brady, M. A Linguistic Feature Vector for The Visual Interpretation of Sign Language. In Proceedings of the European Conference on Computer Vision (ECCV), Prague, Czech Republic, 11–14 May 2004; pp. 390–401. [Google Scholar]
- Kasukurthi, N.; Rokad, B.; Bidani, S.; Dennisan, D.A. American Sign Language Alphabet Recognition using Deep Learning. arXiv 2019, arXiv:1905.05487. [Google Scholar]
- Koller, O.; Zargaran, S.; Ney, H.; Bowden, R. Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs. Int. J. Comput. Vis. 2018, 126, 1311–1325. [Google Scholar] [CrossRef]
- Pu, J.; Zhou, W.; Li, H. Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 885–891. [Google Scholar]
- Pu, J.; Zhou, W.; Li, H. Iterative Alignment Network for Continuous Sign Language Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4160–4169. [Google Scholar]
- Kumar, N. Motion Trajectory Based Human Face and Hands Tracking for Sign Language Recognition. In Proceedings of the 2017 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics, Mathura, India, 26–28 October 2017; pp. 211–216. [Google Scholar]
- Bhuyan, M.K.; Ghoah, D.; Bora, P.K. A Framework for Hand Gesture Recognition with Applications to Sign Language. In Proceedings of the 2006 Annual India Conference, INDICON, New Delhi, India, 15–17 September 2006; pp. 1–6. [Google Scholar]
- Das, S.P.; Talukdar, A.K.; Sarma, K.K. Sign Language Recognition Using Facial Expression. In Proceedings of the Procedia Computer Science, Kerala, India, 10–13 August 2015; pp. 210–216. [Google Scholar]
- Rastgoo, R.; Kiani, K.; Escalera, S.; Sabokrou, M. Sign Language Production: A Review. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 3446–3456. [Google Scholar]
- Dong, S.; Wang, P.; Abbas, K. A Survey on Deep Learning and Its Applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
- Athitsos, V.; Neidle, C.; Sclaroff, S.; Nash, J.; Stefan, A.; Yuan, Q.; Thangali, A. The American Sign Language Lexicon Video Dataset. In Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Anchorage, Alaska, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Bungeroth, J.; Stein, D.; Dreuw, P.; Ney, H.; Morrissey, S.; Way, A.; Zijl, L.V. The ATIS Sign Language Corpus. In Proceedings of the 6th International Conference on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco, 28–30 May 2008; pp. 2943–2946. [Google Scholar]
- Papastratis, I.; Chatzikonstantinou, C.; Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Artificial Intelligence Technologies for Sign Language. Sensors 2021, 21, 5843. [Google Scholar] [CrossRef] [PubMed]
- Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal Multi-cue Network for Continuous Sign Language Recognition. In Proceedings of the AAAI 2020—The Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 13009–13016. [Google Scholar]
- Gündüz, C.; Polat, H. Turkish sign language recognition based on multistream data fusion. Turkish J. Electr. Eng. Comput. Sci. 2021, 29, 1171–1186. [Google Scholar] [CrossRef]
- Bohacek, M.; Hruz, M. Sign Pose-based Transformer for Word-level Sign Language Recognition. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW, Waikoloa, HI, USA, 4–8 January 2022; pp. 182–191. [Google Scholar]
- Vaswani, A. Attention Is All You Need. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 1–11. [Google Scholar]
- Zhou, M.; Ng, M.; Cai, Z.; Cheung, K.C. Self-attention Based Fully-Inception Networks for Continuous Sign Language Recognition. Front. Artif. Intell. Appl. 2020, 325, 2832–2839. [Google Scholar]
- Camgöz, N.C.; Koller, O.; Hadfield, S.; Bowden, R. Sign Language Transformers: Joint end-to-end Sign Language Recognition and Translation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10020–10030. [Google Scholar]
- Min, Y.; Hao, A.; Chai, X.; Chen, X. Visual Alignment Constraint for Continuous Sign Language Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 11542–11551. [Google Scholar]
- Guo, D.; Zhou, W.; Wang, M.; Li, H. Sign Language Recognition Based On Adaptive HMMs with Data Augmentation. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2876–2880. [Google Scholar]
- Huang, J.; Zhou, W.; Li, H.; Li, W. Sign Language Recognition Using 3D Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Turin, Italy, 29 June–3 July 2015; pp. 1–6. [Google Scholar]
- Guo, D.; Zhou, W.; Li, H.; Wang, M. Online early-late fusion based on adaptive HMM for sign language recognition. ACM Trans. Multimed. Comput. Commun. Appl. 2017, 14, 1–18. [Google Scholar] [CrossRef]
- Al-hammadi, M.; Muhammad, G.; Member, S. Hand Gesture Recognition for Sign Language Using 3DCNN. IEEE Access 2020, 8, 79491–79509. [Google Scholar] [CrossRef]
- Reza, H.; Joze, V. MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language. arXiv 2019, arXiv:1812.01053. [Google Scholar]
- Li, D.; Opazo, C.R.; Yu, X.; Li, H. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, WACV, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1448–1458. [Google Scholar]
- Pu, J.; Zhou, W.; Li, H. Sign Language Recognition with Multi-modal Features. In Proceedings of the Pacific Rim Conference on Multimedia, Xi’an, China, 15–16 September 2016; pp. 252–261. [Google Scholar]
- Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton Aware Multi-modal Sign Language Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 3408–3418. [Google Scholar]
- Sidig, A.A.I.; Luqman, H.; Mahmoud, S.; Mohandes, M. KArSL: Arabic Sign Language Database. ACM Trans. Asian Low-Resour. Lang. Inf. Processing 2021, 20, 1–19. [Google Scholar] [CrossRef]
- Koller, O.; Forster, J.; Ney, H. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 2015, 141, 108–125. [Google Scholar] [CrossRef]
- Koller, O.; Camgoz, N.C.; Ney, H. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2306–2320. [Google Scholar] [CrossRef] [PubMed]
- Camgoz, N.C.; Hadfield, S.; Koller, O.; Bowden, R. SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3075–3084. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
- Bressem, K.K.; Adams, L.C.; Erxleben, C.; Hamm, B.; Niehues, S.M.; Vahldiek, J.L. Comparing different deep learning architectures for classification of chest radiographs. Sci. Rep. 2020, 10, 13590. [Google Scholar] [CrossRef]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
- Koller, O.; Zargaran, S.; Ney, H. Re-Sign: Re-Aligned End-to-End Sequence Modeling with Deep Recurrent CNN-HMMs. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honululu, HI, USA, 21–26 July 2017; pp. 3416–3424. [Google Scholar]
- Zhou, H.; Zhou, W.; Li, H. Dynamic pseudo label decoding for continuous sign language recognition. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 18–21 July 2019; pp. 1282–1287. [Google Scholar]
- Xiao, Q.; Chang, X.; Zhang, X.; Liu, X. Video-Based Sign Language Recognition without Temporal Segmentation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 2257–2264. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the ICML ‘06: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. A Novel Connectionist System for Unconstrained Handwriting Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 855–868. [Google Scholar] [CrossRef]
- Guo, D.; Zhou, W.; Li, H.; Wang, M. Hierarchical LSTM for Sign Language Translation. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 6845–6852. [Google Scholar]
- Rahman, M.M.; Watanobe, Y.; Nakamura, K. A Bidirectional LSTM Language Model for Code Evaluation and Repair. Symmetry 2021, 13, 247. [Google Scholar] [CrossRef]
- Hu, W.; Cai, M.; Chen, K.; Ding, H.; Sun, L.; Liang, S.; Mo, X.; Huo, Q. Sequence Discriminative Training for Offline Handwriting Recognition by an Interpolated CTC and Lattice-Free MMI Objective Function. In Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 61–66. [Google Scholar]
- Yoshimura, T.; Hayashi, T.; Takeda, K.; Watanabe, S. End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6999–7003. [Google Scholar]
- Guo, D.; Wang, S.; Tian, Q.; Wang, M. Dense Temporal Convolution Network for Sign Language Translation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2017; pp. 744–750. [Google Scholar]
- Wang, S.; Guo, D.; Zhou, W.; Zha, Z.; Wang, M. Connectionist Temporal Fusion for Sign Language Translation. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 1483–1491. [Google Scholar]
- Yang, Z.; Shi, Z. SF-Net: Structured Feature Network for Continuous Sign Language Recognition. arXiv 2019, arXiv:1908.01341. [Google Scholar]
- Cheng, K.L.; Yang, Z.; Chen, Q.; Tai, Y. Fully Convolutional Networks For Continuous Sign Language Recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 697–714. [Google Scholar]
- Koller, O.; Ney, H.; Bowden, R. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3793–3802. [Google Scholar]
- Slimane, F.B. Context Matters: Self-Attention for Sign Language Recognition. arXiv 2021, arXiv:2101.04632. [Google Scholar]
- Niu, Z.; Mak, B. Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 1–16. [Google Scholar]
- Cui, R.; Liu, H.; Zhang, C. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training. IEEE Trans. Multimed. 2019, 21, 1880–1891. [Google Scholar] [CrossRef]
- Pu, J.; Zhou, W.; Hu, H.; Li, H. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1497–1505. [Google Scholar]
Input | Dev | Test |
---|---|---|
RGB | 25.4 | 26.6 |
Keypoint | 37 | 36.7 |
RGB + Keypoint | 24.0 | 24.3 |
Configuration | Dev | Test |
---|---|---|
Spatial + Early Temporal + Late Temporal Attention | 45.2 | 45.2 |
Early Temporal + Late Temporal Attention | 43.9 | 43.6 |
Spatial Attention | 24.7 | 24.3 |
Early Temporal Attention | 23.8 | 23.9 |
Late Temporal Attention | 20.5 | 21.5 |
Configuration | Dev | Test |
---|---|---|
Late Temporal Attention without pooling | 30.4 | 31.2 |
Late Temporal Attention with pooling | 20.5 | 21.5 |
Configuration | Dev | Test |
---|---|---|
Sequence Learning using LSTM | 30.4 | 31.2 |
Sequence Learning using Bi-LSTM | 20.5 | 21.5 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aditya, W.; Shih, T.K.; Thaipisutikul, T.; Fitriajie, A.S.; Gochoo, M.; Utaminingrum, F.; Lin, C.-Y. Novel Spatio-Temporal Continuous Sign Language Recognition Using an Attentive Multi-Feature Network. Sensors 2022, 22, 6452. https://doi.org/10.3390/s22176452
Aditya W, Shih TK, Thaipisutikul T, Fitriajie AS, Gochoo M, Utaminingrum F, Lin C-Y. Novel Spatio-Temporal Continuous Sign Language Recognition Using an Attentive Multi-Feature Network. Sensors. 2022; 22(17):6452. https://doi.org/10.3390/s22176452
Chicago/Turabian StyleAditya, Wisnu, Timothy K. Shih, Tipajin Thaipisutikul, Arda Satata Fitriajie, Munkhjargal Gochoo, Fitri Utaminingrum, and Chih-Yang Lin. 2022. "Novel Spatio-Temporal Continuous Sign Language Recognition Using an Attentive Multi-Feature Network" Sensors 22, no. 17: 6452. https://doi.org/10.3390/s22176452