Abstract
We present baseline results for a new task of automatic segmentation of Sign Language video into sentence-like units. We use a corpus of natural Sign Language video with accurately aligned subtitles to train a spatio-temporal graph convolutional network with a BiLSTM on 2D skeleton data to automatically detect the temporal boundaries of subtitles. In doing so, we segment Sign Language video into subtitle-units that can be translated into phrases in a written language. We achieve a ROC-AUC statistic of 0.87 at the frame level and 92% label accuracy within a time margin of 0.6s of the true labels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
There are rare video segments where a hearing person is interviewed and this interview is translated into SL.
- 2.
References
Belissen, V., Braffort, A., Gouiffès, M.: Dicta-Sign-LSF-v2: remake of a continuous French sign language dialogue corpus and a first baseline for automatic sign language processing. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pp. 6040–6048. European Language Resource Association (ELRA), Marseille, France, May 2020
Börstell, C., Mesch, J., Wallin, L.: Segmenting the Swedish sign language corpus: on the possibilities of using visual cues as a basis for syntactic segmentation. In: Beyond the Manual Channel. Proceedings of the 6th Workshop on the Representation and Processing of Sign Languages, pp. 7–10 (2014)
Bragg, D., et al.: Sign language recognition, generation, and translation: an interdisciplinary perspective. In: The 21st International ACM SIGACCESS Conference on Computers and Accessibility, pp. 16–31 (2019)
Bull, H., Braffort, A., Gouiffès, M.: MEDIAPI-SKEL - a 2D-skeleton video database of French sign language with aligned French subtitles. In: Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), pp. 6063–6068. European Language Resource Association (ELRA), Marseille, France, May 2020
Chan, T., Zhu, W.: Level set based shape prior segmentation. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 1164–1170. IEEE (2005)
Crasborn, O.A.: How to recognise a sentence when you see one. Sign Lang. Linguist. 10(2), 103–111 (2007)
De Beaugrande, R.: Sentence first, verdict afterwards: on the remarkable career of the “sentence”. Word 50(1), 1–31 (1999)
Dreuw, P., Ney, H.: Towards automatic sign language annotation for the ELAN tool. In: Proceedings of the Third LREC Workshop on Representation and Processing of Sign Languages, pp. 50–53. European Language Resource Association (ELRA), Marrakech, Morocco, May 2008
Fenlon, J., Denmark, T., Campbell, R., Woll, B.: Seeing sentence boundaries. Sign Lang. Linguist. 10(2), 177–200 (2007)
Filhol, M., Hadjadj, M.N., Choisier, A.: Non-manual features: the right to indifference. In: 6th Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel. Satellite Workshop to the 9th International Conference on Language Resources and Evaluation (LREC 2014), pp. 49–54 (2014)
Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Ko, S.K., Kim, C.J., Jung, H., Cho, C.: Neural sign language translation based on human keypoint estimation. Appl. Sci. 9(13), 2683 (2019)
Kolář, J., Lamel, L.: Development and evaluation of automatic punctuation for French and English speech-to-text. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMS. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3416–3424, July 2017. https://doi.org/10.1109/CVPR.2017.364
Sundermeyer, M., Ney, H., Schlüter, R.: From feedforward to recurrent LSTM neural networks for language modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 23(3), 517–529 (2015)
Veksler, O.: Star shape prior for graph-cut image segmentation. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5304, pp. 454–467. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88690-7_34
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Acknowledgements
This work has been partially funded by the ROSETTA project, financed by the French Public Investment Bank (Bpifrance). Additionally, we thank Média-Pi ! for providing the data and for the useful discussions on this idea.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Bull, H., Gouiffès, M., Braffort, A. (2020). Automatic Segmentation of Sign Language into Subtitle-Units. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-66096-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)