[go: up one dir, main page]

Skip to main content
Log in

Contactless interaction recognition and interactor detection in multi-person scenes

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Human interaction recognition is an essential task in video surveillance. The current works on human interaction recognition mainly focus on the scenarios only containing the close-contact interactive subjects without other people. In this paper, we handle more practical but more challenging scenarios where interactive subjects are contactless and other subjects not involved in the interactions of interest are also present in the scene. To address this problem, we propose an Interactive Relation Embedding Network (IRE-Net) to simultaneously identify the subjects involved in the interaction and recognize their interaction category. As a new problem, we also build a new dataset with annotations and metrics for performance evaluation. Experimental results on this dataset show significant improvements of the proposed method when compared with current methods developed for human interaction recognition and group activity recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Zhao J, Han R, Gan Y, Wan L, Feng W, Wang S. Human identification and interaction detection in cross-view multi-person videos with wearable cameras. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020

  2. Li G, Qu W, Huang Q. A multiple targets appearance tracker based on object interaction models. IEEE Transactions on Circuits and Systems for Video Technology, 2012, 22(3): 450–464

    Article  Google Scholar 

  3. Liang J, Jiang L, Niebles J C, Hauptmann A G, Li F F. Peeking into the future: predicting future person activities and locations in videos. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019

  4. Mehran R, Oyama A, Shah M. Abnormal crowd behavior detection using social force model. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009

  5. Han R, Zhao J, Feng W, Gan Y, Wan L, Wang S. Complementary-view co-interest person detection. In: Proceedings of the 28th ACM International Conference on Multimedia. 2020

  6. Ryoo M S, Aggarwal J K. Interaction dataset, ICPR 2010 contest on semantic description of human activities (SDHA 2010). See cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html website, 2010

  7. Yun K, Honorio J, Chattopadhyay D, Berg T L, Samaras D. Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings of 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2012

  8. Gu C, Sun C, Ross D A, Vondrick C, Pantofaru C, Li Y, Vijayanarasimhan S, Toderici G, Ricco S, Sukthankar R, Schmid C, Malik J. AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018

  9. Han R, Feng W, Zhang Y, Zhao J, Wang S. Multiple human association and tracking from egocentric and complementary top views. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(9): 5225–5242

    Google Scholar 

  10. Han R, Zhang Y, Feng W, Gong C, Zhang X, Zhao J, Wan L, Wang S. Multiple human association between top and horizontal views by matching subjects’ spatial distributions. 2019, arXiv preprint arXiv: 1907.11458

  11. Han R, Feng W, Zhao J, Niu Z, Zhang Y, Wan L, Wang S. Complementary-view multiple human tracking. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020

  12. Carreira J, Noland E, Hillier C, Zisserman A. A short note on the kinetics-700 human action dataset. 2019, arXiv preprint arXiv: 1907.06987

  13. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A. The kinetics human action video dataset. 2017, arXiv preprint arXiv: 1907.06987

  14. Kong Y, Jia Y, Fu Y. Learning human interaction by interactive phrases. In: Proceedings of the 12th European Conference on Computer Vision. 2012

  15. Van Gemeren C, Poppe R, Veltkamp R C. Spatio-temporal detection of fine-grained dyadic human interactions. In: Proceedings of the 7th International Workshop on Human Behavior Understanding. 2016

  16. Taylor G W, Fergus R, LeCun Y, Bregler C. Convolutional learning of spatio-temporal features. In: Proceedings of the 11th European Conference on Computer Vision. 2010

  17. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). 2015

  18. Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017

  19. Zhang C, Zou Y, Chen G, Gan L. PAN: persistent appearance network with an efficient motion cue for fast action recognition. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019

  20. Wang Z, Liu S, Zhang J, Chen S, Guan Q. A spatio-temporal crf for human interaction understanding. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(8): 1647–1660

    Article  Google Scholar 

  21. Motiian S, Siyahjani F, Almohsen R, Doretto G. Online human interaction detection and recognition with multiple cameras. IEEE Transactions on Circuits and Systems for Video Technology, 2017, 27(3): 649–663

    Article  Google Scholar 

  22. Song S, Lan C, Xing J, Zeng W, Liu J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017

  23. Gao X, Hu W, Tang J, Liu J, Guo Z. Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019

  24. Tang Y, Tian Y, Lu J, Li P, Zhou J. Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018

  25. Wang Z, Ge J, Guo D, Zhang J, Lei Y, Chen S. Human interaction understanding with joint graph decomposition and node labeling. IEEE Transactions on Image Processing, 2021, 30: 6240–6254

    Article  MathSciNet  Google Scholar 

  26. Feichtenhofer C, Pinz A, Wildes R P. Spatiotemporal residual networks for video action recognition. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. 2016

  27. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018

  28. Qiu Z, Yao T, Mei T. Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017

  29. Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of 2013 IEEE International Conference on Computer Vision. 2013

  30. Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015

  31. Lee D G, Lee S W. Human interaction recognition framework based on interacting body part attention. Pattern Recognition, 2022, 128: 108645

    Article  Google Scholar 

  32. Tu H, Xu R, Chi R, Peng Y. Multiperson interactive activity recognition based on interaction relation model. Journal of Mathematics, 2021, 2021: 5576369

    Article  MathSciNet  Google Scholar 

  33. Verma A, Meenpal T, Acharya B. Multiperson interaction recognition in images: a body keypoint based feature image analysis. Computational Intelligence, 2021, 37(1): 461–483

    Article  MathSciNet  Google Scholar 

  34. Patron-Perez A, Marszalek M, Reid I, Zisserman A. Structured learning of human interactions in TV shows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(12): 2441–2453

    Article  Google Scholar 

  35. Zhao H, Torralba A, Torresani L, Yan Z. HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019

  36. Joo H, Liu H, Tan L, Gui L, Nabbe B, Matthews I, Kanade T, Nobuhara S, Sheikh Y. Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of 2015 IEEE International Conference on Computer Vision (ICCV). 2015

  37. Ehsanpour M, Saleh F, Savarese S, Reid I, Rezatofighi H. JRDB-Act: a large-scale dataset for spatio-temporal action, social group and activity detection. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022

  38. Li J, Han R, Yan H, Qian Z, Feng W, Wang S. Self-supervised social relation representation for human group detection. In: Proceedings of the 17th European Conference on Computer Vision. 2022

  39. Han R, Yan H, Li J, Wang S, Feng W, Wang S. Panoramic human activity recognition. In: Proceedings of the 17th European Conference on Computer Vision. 2022

  40. Shu T, Todorovic S, Zhu S C. CERN: confidence-energy recurrent network for group activity recognition. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017

  41. Shu X, Tang J, Qi G, Liu W, Yang J. Hierarchical long short-term concurrent memory for human interaction recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(3): 1110–1118

    Article  Google Scholar 

  42. Zhang P, Tang Y, Hu J F, Zheng W S. Fast collective activity recognition under weak supervision. IEEE Transactions on Image Processing, 2020, 29: 29–43

    Article  MathSciNet  Google Scholar 

  43. Yuan H, Ni D. Learning visual context for group activity recognition. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021

  44. Yan R, Tang J, Shu X, Li Z, Tian Q. Participation-contributed temporal dynamic model for group activity recognition. In: Proceedings of the 26th ACM International Conference on Multimedia. 2018

  45. Wu J, Wang L, Wang L, Guo J, Wu G. Learning actor relation graphs for group activity recognition. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019

  46. Choi W, Shahid K, Savarese S. What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: Proceedings of the 12th IEEE International Conference on Computer Vision Workshops, ICCV Workshops. 2009

  47. Ibrahim M S, Muralidharan S, Deng Z, Vahdat A, Mori G. A hierarchical deep temporal model for group activity recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

  48. Li W, Duan Y, Lu J, Feng J, Zhou J. Graph-based social relation reasoning. In: Proceedings of the 16th European Conference on Computer Vision. 2020

  49. Li J, Wong Y, Zhao Q, Kankanhalli M S. Visual social relationship recognition. International Journal of Computer Vision, 2020, 128(6): 1750–1764

    Article  Google Scholar 

  50. Qi S, Wang W, Jia B, Shen J, Zhu S C. Learning human-object interactions by graph parsing neural networks. In: Proceedings of the 15th European Conference on Computer Vision. 2018

  51. Zhong X, Ding C, Qu X, Tao D. Polysemy deciphering network for robust human-object interaction detection. International Journal of Computer Vision, 2021, 129(6): 1910–1929

    Article  Google Scholar 

  52. Qiao T, Men Q, Li F W, Kubotani Y, Morishima S, Shum H P H. Geometric features informed multi-person human-object interaction recognition in videos. In: Proceedings of the 17th European Conference on Computer Vision. 2022

  53. Bai L, Chen F, Tian Y. Automatically detecting human-object interaction by an instance part-level attention deep framework. Pattern Recognition, 2023, 134: 109110

    Article  Google Scholar 

  54. Li F, Wang S, Wang S, Zhang L. Human-object interaction detection: a survey of deep learning-based methods. In: Proceedings of the 2nd CAAI International Conference on Artificial Intelligence. 2022

  55. Antoun M, Asmar D. Human object interaction detection: design and survey. Image and Vision Computing, 2023, 130: 104617

    Article  Google Scholar 

  56. Lim J, Baskaran V M, Lim J M Y, Wong K, See J, Tistarelli M. ERNet: an efficient and reliable human-object interaction detection network. IEEE Transactions on Image Processing, 2023, 32: 964–979

    Article  Google Scholar 

  57. Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010

  58. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016

  59. He K M, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. In: Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV). 2017

  60. Schroff F, Kalenichenko D, Philbin J. FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015

  61. Zhang Y, Wang C, Wang X, Zeng W, Liu W. FairMOT: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, 2021, 129(11): 3069–3087

    Article  Google Scholar 

  62. Feichtenhofer C. X3D: expanding architectures for efficient video recognition. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020

  63. Feichtenhofer C, Fan H, Malik J, He K. SlowFast networks for video recognition. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019

  64. Yan R, Xie L, Tang J, Shu X, Tian Q. HiGCIN: hierarchical graph-based cross inference network for group activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(6): 6955–6968

    Article  Google Scholar 

  65. Yuan H, Ni D, Wang M. Spatio-temporal dynamic inference network for group activity recognition. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV). 2021

  66. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. 2016

  67. Han R, Gan Y, Li J, Wang F, Feng W, Wang S. Connecting the complementary-view videos: joint camera identification and subject association. In: Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022

  68. Han R, Gan Y, Wang L, Li N, Feng W, Wang S. Relating view directions of complementary-view mobile cameras via the human shadow. International Journal of Computer Vision, 2023, 131(5): 1106–1121

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (NSFC) (Grant Nos. 62072334, U1803264).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruize Han.

Ethics declarations

Competing interests The authors declare that they have no competing interests or financial conflicts to disclose.

Additional information

Jiacheng Li received the BS degree in computer sciense and technology from Beijing University of Chemical Technology, China in 2019, and the ME degree in computer sciense and technology from Tianjin University, China in 2022. His major research interest is visual intelligence, specifically including multi-object interaction and social relation discovery.

Ruize Han received the BS degree in mathematics and applied mathematics from Hebei University of Technology, China in 2016, the ME and PhD degrees in computer sciense and technology from Tianjin University, China in 2019 and 2023, respectively. His major research interest is visual intelligence, specifically including multi-camera video collaborative analysis and multi-human activity understanding. He was also interested in solving preventive conservation problems of cultural heritages via artificial intelligence.

Wei Feng received the PhD degree in computer science from City University of Hong Kong, China in 2008. From 2008 to 2010, he was a research fellow at the Chinese University of Hong Kong, China and City University of Hong Kong, China. He is now a Professor at the School of Computer Science and Technology, College of Computing and Intelligence, Tianjin University, China. His major research interests are active robotic vision and visual intelligence. Recently, he focuses on solving preventive conservation problems of cultural heritages via computer vision and machine learning. He is the Associate Editor of Neurocomputing and Journal of Ambient Intelligence and Humanized Computing.

Haomin Yan received the BE degree in the School of Electrical and Information Engineering and the ME degree in computer technology from Tianjin University, China in 2020 and 2023, respectively. His research interests focus on multi-human action analysis, specially for the weakly supervised individual action detection and social group activity detection.

Song Wang received the PhD degree in electrical and computer engineering from the University of Illinois at Urbana Champaign (UIUC), USA in 2002. He was a Research Assistant with the Image Formation and Processing Group, Beckman Institute, UIUC, USA from 1998 to 2002. In 2002, he joined the Department of Computer Science and Engineering, University of South Carolina, USA, where he is currently a Professor. His current research interests include computer vision, image processing, and machine learning. Dr. Wang is currently serving as the Publicity/Web Portal Chair of the Technical Committee of Pattern Analysis and Machine Intelligence of the IEEE Computer Society, an Associate Editor of IEEE Transaction on Pattern Analysis and Machine Intelligence, IEEE Transaction on Multimedia and Pattern Recognition Letters. He is a Senior Member of the IEEE and a member of the IEEE Computer Society.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Han, R., Feng, W. et al. Contactless interaction recognition and interactor detection in multi-person scenes. Front. Comput. Sci. 18, 185325 (2024). https://doi.org/10.1007/s11704-023-2418-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-023-2418-0

Keywords

Navigation