[go: up one dir, main page]

Skip to main content
Log in

Local–Global Transformer Neural Network for temporal action segmentation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

The temporal action segmentation task is a branch of video understanding that aims to predict what is happening in the action segments (comprising a series of consecutive action frames with identical labels) in an untrimmed video. Recent works have harnessed the Transformer, which is capable of modeling temporal relations in long sequences. However, there are several limitations when utilizing Transformer-based networks for processing video sequences, such as (1) the dramatic changes to the neighboring action segments, (2) the paradox between the loss of fine-grained information in deeper layers and inefficient learning with small receptive fields, and (3) the lack of refinement process to raise the performance. This paper proposes a novel network to address the above difficulties called the Local–Global Transformer Neural Network (LGTNN). LGTNN comprises three main modules. The first two modules are the Local and Global Transformer modules, which efficiently capture multiscale features and solve the paradox of perceiving higher- and lower-level representations at different convolutional layer depths. The third module, called the Boundary Detection Network (BDN), executes a postprocessing procedure and helps to finetune ambiguous action boundaries and generate the final prediction. Our proposed model can be embedded in existing temporal action segmentation models, such as MS-TCN, ASFormer, and ETSN. The results of experiments conducted on three challenging datasets (50Salads, Georgia Tech Egocentric Activities (GTEA), and Breakfast) using LGTNN both singly and embedded in existing segmentation models verify that it outperforms state-of-the-art methods by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bhering, F., Passos, D., Ochi, L.S., et al.: Wireless multipath video transmission: when IoT video applications meet networking—a survey. Multimedia Syst. 28(3), 831–850 (2022)

    Article  Google Scholar 

  2. Ullah, H., Islam, I.U., Ullah, M., et al.: Multi-feature-based crowd video modeling for visual event detection. Multimedia Syst. 27(4), 589–597 (2021)

    Article  Google Scholar 

  3. Lu, Y., An, S.: Research on sports video detection technology motion 3d reconstruction based on hidden markov model. Cluster Comput. 23(3), 1899–1909 (2020)

    Article  Google Scholar 

  4. Hossain, M.S., Muhammad, G., Alamri, A.: Smart healthcare monitoring: a voice pathology detection paradigm for smart cities. Multimedia Syst. 25(5), 565–575 (2019)

    Article  Google Scholar 

  5. He, J., Xie, Y., Luan, X., Zhang, L., Zhang, X.: Srn: The movie character relationship analysis via social network. In: 24th International Conference on MultiMedia Modeling (MMM) 10705, 289–301 (2018)

  6. Kacprzyk, J., Knyazeva, M., Bozhenyuk, A.: Fuzzy Interval-Valued Temporal Automated Planning and Scheduling Problem. In: International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions, 51–58 (2021)

  7. Zhang, H., Liu, D., Xiong, Z.: Ieee Two-stream action recognition-oriented video super-resolution. In: IEEE/CVF International Conference on Computer Vision (ICCV), 8798–8807 (2019)

  8. Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: Ieee A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1961–1970 (2016)

  9. Xu, N., Liu, A.-A., Wong, Y., Zhang, Y., Nie, W., Su, Y., Kankanhalli, M.: Dual-stream recurrent neural network for video captioning. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2482–2493 (2019)

    Article  Google Scholar 

  10. Yu, T., Li, Y., Li, B.: Rhyrnn: Rhythmic rnn for recognizing events in long and complex videos. 16th European Conference on Computer Vision (ECCV), 127–144.s (2020)

  11. Mavroudi, E., Bhaskara, D., Sefati, S., Ali, H., Vidal, R.: Ieee End-to-end fine-grained action segmentation and recognition using conditional random field models and discriminative sparse coding. In: 18th IEEE Winter Conference on Applications of Computer Vision (WACV), 1558–1567 (2018)

  12. Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Ieee Temporal convolutional networks for action segmentation and detection. In: 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1003–1012 (2017)

  13. Abu Farha, Y., Gall, J., Soc, I.C.: Ms-tcn: Multi-stage temporal convolutional network for action segmentation. In: 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3570–3579 (2019)

  14. Lei, P., Todorovic, S.: Ieee Temporal deformable residual networks for action segmentation in videos. In: 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6742–6751 (2018)

  15. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: 31st Annual Conference on Neural Information Processing Systems (NIPS) 30 (2017)

  16. Peng, Z., Huang, W., Gu, S., Xie, L., Wang, Y., Jiao, J., Ye, Q.: Conformer: Local features coupling global representations for visual recognition. In: IEEE/CVF International Conference on Computer Vision (ICCV), 367–376 (2021)

  17. Wan, K., He, B., Zh, W-P., Ieee Tstnn: Two-stage transformer based neural network for speech enhancement in the time domain. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7098–7102 (2021)

  18. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer (2020)

  19. Fathi, A., Ren, X., Rehg, J.M.: Ieee Learning to recognize objects in egocentric activities. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)

  20. Ishikawa, Y., Kasai, S., Aoki, Y., Kataoka, H.: Ieee Alleviating over-segmentation errors by detecting action boundaries. IEEE Winter Conference on Applications of Computer Vision (WACV), 2321–2330 (2021)

  21. Wang, D., Hu, D., Li, X., Dou, D., Assoc Advancement Artificial I.: Temporal relational modeling with self-supervision for action segmentation. In: 35th AAAI Conference on Artificial Intelligence / 33rd Conference on Innovative Applications of Artificial Intelligence / 11th Symposium on Educational Advances in Artificial Intelligence 35, 2729–2737 (2021)

  22. Stein, S., McKenna, S.J., Assoc Comp M.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), 729–738 (2013)

  23. Li, Y., Dong, Z., Liu, K., Feng, L., Hu, L., Zhu, J., Xu, L., Wang, Y., Liu, S.: Efficient two-step networks for temporal action segmentation. Neurocomputing 454, 373–381 (2021)

    Article  Google Scholar 

  24. Li, S-J., Abu Farha, Y., Liu, Y., Cheng, M-M., Gall, J.: Ms-tcn++: Multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell (2020)

  25. Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. ECCV THUMOS Workshop (2014)

  26. Kuehne, H., Gall, J., Serre, T.: An end-to-end generative framework for video segmentation and recognition. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2016)

  27. Vo, N.N., Bobick, A.F.: Ieee From stochastic grammar to bayes network: Probabilistic parsing of complex activity. 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2641–2648 (2014)

  28. Huang, Y., Sugano, Y., Sato, Y.: Improving action segmentation via graph based temporal reasoning. 33th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 14024–14034. (2020)

  29. Wang, Z., Gao, Z., Wang, L., Li, Z., Wu, G.: Boundary-aware cascade networks for temporal action segmentation. In: 16th European Conference on Computer Vision (ECCV), 34–51 (2020)

  30. Wang, D., Yuan, Y., Wang, Q.: Gated forward refinement network for action segmentation. Neurocomputing 407, 63–71 (2020)

    Article  Google Scholar 

  31. Singhania, D., Rahaman, R., Yao, A.: Coarse to fine multi-resolution temporal convolutional network. arXiv preprint arXiv:2105.10859 (2021)

  32. Ahn, H., Lee, D.: Refining action segmentation with hierarchical video representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 16302–16310 (2021)

  33. Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H., Soc, I.C.: Local temporal bilinear pooling for fine-grained action parsing. In: 32nd IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11997–12007 (2019)

  34. Zhang, Y., Muandet, K., Ma, Q., Neumann, H., Tang, S.: Frontal low-rank random tensors for fine-grained action segmentation. arXiv preprint arXiv:1906.01004 (2019)

  35. Gammulle, H., Denman, S., Sridharan, S., Fookes, C.: Fine-grained action segmentation using the semi-supervised action gan. Pattern Recognit. 98, 107039 (2020)

    Article  Google Scholar 

  36. Chen, M., Li, B., Bao, Y., Alregib, G., Kira, Z.: Action segmentation with joint self-supervised temporal domain adaptation. 33th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9454–9463 (2020)

  37. Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. European Conference on Computer Vision, 528–543 (2020)

  38. Dai, Z., Cai, B., Lin. Y., Chen, J., Ieee Comp S O C Up-detr: Unsupervised pre-training for object detection with transformers. 34th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1601–1610 (2021)

  39. Yi, F., Wen, H., Jiang, T.: Asformer: Transformer for action segmentation (2021)

  40. Carreira, J., Zisserman, A.: Ieee Quo vadis, action recognition? A new model and the kinetics dataset. 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4724–4733 (2017)

  41. Tay, Y., Dehghani, M., Bahri, D., et al.: Efficient transformers: a survey. ACM Comput. Surv (CSUR) (2020). https://doi.org/10.1145/3530811

    Article  Google Scholar 

  42. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: Bsn: Boundary sensitive network for temporal action proposal generation. In: 15th European Conference on Computer Vision (ECCV) 11208, 3–21 (2018)

  43. Kuehne, H., Arslan, A., Serre, T.: Ieee The language of actions: Recovering the syntax and semantics of goal-directed human activities. 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 780–787 (2014)

  44. Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers. 10(3), 61–74 (1999)

    Google Scholar 

  45. Guo C, Pleiss G, Sun Y, et al (2017) On calibration of modern neural networks. International Conference on Machine Learning, 1321–1330. PMLR.

  46. Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant Number: 51935005), Basic Scientific Research Project (Grant Number: JCKY20200603C010), Natural Science Foundation of Heilongjiang Province of China (Grant Number: LH2021F023) and Science & Technology Planned Project of Heilongjiang Province of China (Grant Number: GA21C031).

Author information

Authors and Affiliations

Authors

Contributions

XT: Conceptualization, Methodology, Software, Data curation, Writing—original draft. YJ: Supervision, Writing-review & editing. XT: Supervision.

Corresponding author

Correspondence to Ye Jin.

Ethics declarations

Conflict of interest

We declare that we have no known competing financial interests or personal relationships that have influenced the work reported in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, X., Jin, Y. & Tang, X. Local–Global Transformer Neural Network for temporal action segmentation. Multimedia Systems 29, 615–626 (2023). https://doi.org/10.1007/s00530-022-00998-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00998-4

Keywords

Navigation