Local–Global Transformer Neural Network for temporal action segmentation

Xiaoyan Tian¹,
Ye Jin¹ &
Xianglong Tang¹

828 Accesses
12 Citations
Explore all metrics

Abstract

The temporal action segmentation task is a branch of video understanding that aims to predict what is happening in the action segments (comprising a series of consecutive action frames with identical labels) in an untrimmed video. Recent works have harnessed the Transformer, which is capable of modeling temporal relations in long sequences. However, there are several limitations when utilizing Transformer-based networks for processing video sequences, such as (1) the dramatic changes to the neighboring action segments, (2) the paradox between the loss of fine-grained information in deeper layers and inefficient learning with small receptive fields, and (3) the lack of refinement process to raise the performance. This paper proposes a novel network to address the above difficulties called the Local–Global Transformer Neural Network (LGTNN). LGTNN comprises three main modules. The first two modules are the Local and Global Transformer modules, which efficiently capture multiscale features and solve the paradox of perceiving higher- and lower-level representations at different convolutional layer depths. The third module, called the Boundary Detection Network (BDN), executes a postprocessing procedure and helps to finetune ambiguous action boundaries and generate the final prediction. Our proposed model can be embedded in existing temporal action segmentation models, such as MS-TCN, ASFormer, and ETSN. The results of experiments conducted on three challenging datasets (50Salads, Georgia Tech Egocentric Activities (GTEA), and Breakfast) using LGTNN both singly and embedded in existing segmentation models verify that it outperforms state-of-the-art methods by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bottom-up improved multistage temporal convolutional network for action segmentation

Article 02 March 2022

ASGSA: global semantic-aware network for action segmentation

Article 26 April 2024

TSRN: two-stage refinement network for temporal action segmentation

Article 15 May 2023

References

Bhering, F., Passos, D., Ochi, L.S., et al.: Wireless multipath video transmission: when IoT video applications meet networking—a survey. Multimedia Syst. 28(3), 831–850 (2022)
Article Google Scholar
Ullah, H., Islam, I.U., Ullah, M., et al.: Multi-feature-based crowd video modeling for visual event detection. Multimedia Syst. 27(4), 589–597 (2021)
Article Google Scholar
Lu, Y., An, S.: Research on sports video detection technology motion 3d reconstruction based on hidden markov model. Cluster Comput. 23(3), 1899–1909 (2020)
Article Google Scholar
Hossain, M.S., Muhammad, G., Alamri, A.: Smart healthcare monitoring: a voice pathology detection paradigm for smart cities. Multimedia Syst. 25(5), 565–575 (2019)
Article Google Scholar
He, J., Xie, Y., Luan, X., Zhang, L., Zhang, X.: Srn: The movie character relationship analysis via social network. In: 24th International Conference on MultiMedia Modeling (MMM) 10705, 289–301 (2018)
Kacprzyk, J., Knyazeva, M., Bozhenyuk, A.: Fuzzy Interval-Valued Temporal Automated Planning and Scheduling Problem. In: International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions, 51–58 (2021)
Zhang, H., Liu, D., Xiong, Z.: Ieee Two-stream action recognition-oriented video super-resolution. In: IEEE/CVF International Conference on Computer Vision (ICCV), 8798–8807 (2019)
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: Ieee A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1961–1970 (2016)
Xu, N., Liu, A.-A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2482–2493 (2019)
Article Google Scholar
Yu, T., Li, Y., Li, B.: Rhyrnn: Rhythmic rnn for recognizing events in long and complex videos. 16th European Conference on Computer Vision (ECCV), 127–144.s (2020)
Mavroudi, E., Bhaskara, D., Sefati, S., Ali, H., Vidal, R.: Ieee End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In: 18th IEEE Winter Conference on Applications of Computer Vision (WACV), 1558–1567 (2018)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Ieee Temporal convolutional networks for action segmentation and detection. In: 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1003–1012 (2017)
Abu Farha, Y., Gall, J., Soc, I.C.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3570–3579 (2019)
Lei, P., Todorovic, S.: Ieee Temporal deformable residual networks for action segmentation in videos. In: 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6742–6751 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: 31st Annual Conference on Neural Information Processing Systems (NIPS) 30 (2017)
Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: Local features coupling global representations for visual recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), 367–376 (2021)
Wan, K., He, B., Zh, W-P., Ieee Tstnn: Two-stage transformer based neural network for speech enhancement in the time domain. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7098–7102 (2021)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer (2020)
Fathi, A., Ren, X., Rehg, J.M.: Ieee Learning to recognize objects in egocentric activities. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)
Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Ieee Alleviating over-segmentation errors by detecting action boundaries. IEEE Winter Conference on Applications of Computer Vision (WACV), 2321–2330 (2021)
Wang, D., Hu, D., Li, X., Dou, D., Assoc Advancement Artificial I.: Temporal relational modeling with self-supervision for action segmentation. In: 35th AAAI Conference on Artificial Intelligence / 33rd Conference on Innovative Applications of Artificial Intelligence / 11th Symposium on Educational Advances in Artificial Intelligence 35, 2729–2737 (2021)
Stein, S., McKenna, S.J., Assoc Comp M.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), 729–738 (2013)
Li, Y., Dong, Z., Liu, K., Feng, L., Hu, L., Zhu, J., Xu, L., Wang, Y., Liu, S.: Efficient two-step networks for temporal action segmentation. Neurocomputing 454, 373–381 (2021)
Article Google Scholar
Li, S-J., Abu Farha, Y., Liu, Y., Cheng, M-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell (2020)
Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. ECCV THUMOS Workshop (2014)
Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2016)
Vo, N.N., Bobick, A.F.: Ieee From stochastic grammar to bayes network: Probabilistic parsing of complex activity. 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2641–2648 (2014)
Huang, Y., Sugano, Y., Sato, Y.: Improving action segmentation via graph based temporal reasoning. 33th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 14024–14034. (2020)
Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: 16th European Conference on Computer Vision (ECCV), 34–51 (2020)
Wang, D., Yuan, Y., Wang, Q.: Gated forward refinement network for action segmentation. Neurocomputing 407, 63–71 (2020)
Article Google Scholar
Singhania, D., Rahaman, R., Yao, A.: Coarse to fine multi-resolution temporal convolutional network. arXiv preprint arXiv:2105.10859 (2021)
Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 16302–16310 (2021)
Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H., Soc, I.C.: Local temporal bilinear pooling for fine-grained action parsing. In: 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11997–12007 (2019)
Zhang, Y., Muandet, K., Ma, Q., Neumann, H., Tang, S.: Frontal low-rank random tensors for fine-grained action segmentation. arXiv preprint arXiv:1906.01004 (2019)
Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Fine-grained action segmentation using the semi-supervised action gan. Pattern Recognit. 98, 107039 (2020)
Article Google Scholar
Chen, M., Li, B., Bao, Y., Alregib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. 33th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9454–9463 (2020)
Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. European Conference on Computer Vision, 528–543 (2020)
Dai, Z., Cai, B., Lin. Y., Chen, J., Ieee Comp S O C Up-detr: Unsupervised pre-training for object detection with transformers. 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1601–1610 (2021)
Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation (2021)
Carreira, J., Zisserman, A.: Ieee Quo vadis, action recognition? A new model and the kinetics dataset. 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733 (2017)
Tay, Y., Dehghani, M., Bahri, D., et al.: Efficient transformers: a survey. ACM Comput. Surv (CSUR) (2020). https://doi.org/10.1145/3530811
Article Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: Boundary sensitive network for temporal action proposal generation. In: 15th European Conference on Computer Vision (ECCV) 11208, 3–21 (2018)
Kuehne, H., Arslan, A., Serre, T.: Ieee The language of actions: Recovering the syntax and semantics of goal-directed human activities. 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 780–787 (2014)
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers. 10(3), 61–74 (1999)
Google Scholar
Guo C, Pleiss G, Sun Y, et al (2017) On calibration of modern neural networks. International Conference on Machine Learning, 1321–1330. PMLR.
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Number: 51935005), Basic Scientific Research Project (Grant Number: JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant Number: LH2021F023) and Science & Technology Planned Project of Heilongjiang Province of China (Grant Number: GA21C031).

Author information

Authors and Affiliations

Faculty of Computing, Harbin Institute of Technology, Harbin, 15001, China
Xiaoyan Tian, Ye Jin & Xianglong Tang

Authors

Xiaoyan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Ye Jin
View author publications
You can also search for this author in PubMed Google Scholar
Xianglong Tang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XT: Conceptualization, Methodology, Software, Data curation, Writing—original draft. YJ: Supervision, Writing-review & editing. XT: Supervision.

Corresponding author

Correspondence to Ye Jin.

Ethics declarations

Conflict of interest

We declare that we have no known competing financial interests or personal relationships that have influenced the work reported in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tian, X., Jin, Y. & Tang, X. Local–Global Transformer Neural Network for temporal action segmentation. Multimedia Systems 29, 615–626 (2023). https://doi.org/10.1007/s00530-022-00998-4

Download citation

Received: 23 June 2022
Accepted: 26 August 2022
Published: 14 October 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s00530-022-00998-4

Local–Global Transformer Neural Network for temporal action segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bottom-up improved multistage temporal convolutional network for action segmentation

ASGSA: global semantic-aware network for action segmentation

TSRN: two-stage refinement network for temporal action segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Local–Global Transformer Neural Network for temporal action segmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bottom-up improved multistage temporal convolutional network for action segmentation

ASGSA: global semantic-aware network for action segmentation

TSRN: two-stage refinement network for temporal action segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation