Fusion Object Detection and Action Recognition to Predict Violent Action
<p>Architecture description of proposed solution.</p> "> Figure 2
<p>Two main development pipelines are presented: (1) lost items and aggressive objects; (2) violent action. Both with a focus on in-vehicle environment. For each pipeline, 3 identical steps are made: (1) a toolchain for data generation is implemented (<a href="#sec3dot2-sensors-23-05610" class="html-sec">Section 3.2</a> and <a href="#sec3dot3-sensors-23-05610" class="html-sec">Section 3.3</a>); (2) models are selected (<a href="#sec4dot1dot1-sensors-23-05610" class="html-sec">Section 4.1.1</a> and <a href="#sec4dot1dot2-sensors-23-05610" class="html-sec">Section 4.1.2</a>); (3) evaluate in (<a href="#sec5dot1dot1-sensors-23-05610" class="html-sec">Section 5.1.1</a>, <a href="#sec5dot1dot2-sensors-23-05610" class="html-sec">Section 5.1.2</a> and <a href="#sec5dot1dot3-sensors-23-05610" class="html-sec">Section 5.1.3</a>).</p> "> Figure 3
<p>Annotation in MoLa data format.</p> "> Figure 4
<p>From Top Left to Bottom Right: RGB, Depth, Point-Cloud, Thermal, NVS, and Events Grayscale.</p> "> Figure 5
<p>Laboratory car testbed from different perspectives.</p> "> Figure 6
<p>Dataset-to-JSON example.</p> "> Figure 7
<p>JSONs to MOLA example.</p> "> Figure 8
<p>MOLA JSON to YOLOv5.</p> "> Figure 9
<p>Pipeline for inference in NVIDIA AGX Xavier.</p> "> Figure 10
<p>Visualization of 76 classes of the dataset. (<b>a</b>) Number of annotations per class. (<b>b</b>) Visualization of the location and size of each bounding box. (<b>c</b>) The statistical distribution of the bounding box position. (<b>d</b>) The statistical distributions of the bounding box sizes.</p> "> Figure 11
<p>Visualization of the 3 most labeled classes of the dataset. (<b>a</b>) Number of annotations per class. (<b>b</b>) Visualization of the location and size of each bounding box. (<b>c</b>) The statistical distribution of the bounding box position. (<b>d</b>) The statistical distribution of the bounding box sizes.</p> "> Figure 12
<p>Confusion matrix of knife, weapon, and bat classes.</p> "> Figure 13
<p>Visualization of classes for lost items objects on the dataset. (<b>a</b>) Number of annotations per class. (<b>b</b>) Visualization of the location and size of each bounding box. (<b>c</b>) The statistical distribution of the bounding box position. (<b>d</b>) The statistical distributions of the bounding box sizes.</p> "> Figure 14
<p>Confusion matrix for lost items.</p> "> Figure 15
<p>Examples of results obtained from real-time inferences. (<b>a</b>) Lost items detector. (<b>b</b>) Aggressive objects detector. (<b>c</b>) Violent action detection.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Dataset Preparation
3.1. MOLAnnotate Framework
3.1.1. Unified Format
3.1.2. Annotation Pipeline
3.2. Violent Action
3.3. Aggressive Objects
4. Algorithmic Analysis
4.1. Methods
4.1.1. Violent Action
4.1.2. Aggressive Objects
4.1.3. Embedded System
5. Results
5.1. Experimental Evaluation
5.1.1. Violent Action
5.1.2. Aggressive Objects
5.1.3. Lost Item Objects
5.1.4. Embedded System
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Narayanan, S.; Chaniotakis, E.; Antoniou, C. Shared autonomous vehicle services: A comprehensive review. Transp. Res. Part C Emerg. Technol. 2020, 111, 255–293. [Google Scholar] [CrossRef]
- SAE J3016; Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles. SAE International: Warrendale, PA, USA, 2021.
- Shaheen, S.; Chan, N.; Bansal, A.; Cohen, A. Definitions, Industry Developments, and Early Understanding; Transportation Sustainability Research Center, Innovative Mobility Research: Berkeley, CA, USA, 2015. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science. pp. 740–755. [Google Scholar] [CrossRef]
- Dave, A.; Khurana, T.; Tokmakov, P.; Schmid, C.; Ramanan, D. TAO: A Large-Scale Benchmark for Tracking Any Object. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science. pp. 436–454. [Google Scholar] [CrossRef]
- Rodrigues, N.R.; da Costa, N.M.; Novais, R.; Fonseca, J.; Cardoso, P.; Borges, J. AI based monitoring violent action detection data for in-vehicle scenarios. Data Brief 2022, 45, 108564. [Google Scholar] [CrossRef] [PubMed]
- Jocher, G. ultralytics/yolov5: V7.0—YOLOv5 SOTA Realtime Instance Segmentation (v7.0). 2022, p. 10. Available online: https://zenodo.org/record/7347926 (accessed on 28 February 2023).
- Mobileye. BMW Group, Intel and Mobileye Team up to Bring Fully Autonomous Driving to Streets by 2021. 2016. Available online: https://www.press.bmwgroup.com/global/article/detail/T0261586EN/bmw-group-intel-and-mobileye-team-up-to-bring-fully-autonomous-driving-to-streets-by-2021?language=en (accessed on 30 January 2023).
- Ford. Ford Targets Fully Autonomous Vehicle for Ride Sharing in 2021; Invests in New Tech Companies, Doubles Silicon Valley Team | Ford Media Center. 2016. Available online: https://media.ford.com/content/fordmedia/fna/us/en/news/2016/08/16/ford-targets-fully-autonomous-vehicle-for-ride-sharing-in-2021.html (accessed on 30 January 2023).
- O’Kane, S. Former Google Self-Driving Will Help Volkswagen and Hyundai Build Fully Autonomous Cars. 2018. Available online: https://www.theverge.com/2018/1/4/16846526/aurora-chris-urmson-volkswagen-hyundai-self-driving-cars (accessed on 30 January 2023).
- Daimler, A.G. Daimler and Uber Join Forces to Bring More Self-Driving Vehicles on the Road—Daimler Global Media Site. 2018. Available online: https://www.prnewswire.com/news-releases/daimler-and-uber-join-forces-to-bring-more-self-driving-vehicles-on-the-road-300399621.html (accessed on 30 January 2023).
- LeBeau, P. Waymo Starts Comercial Ride-Share Service. 2018. Available online: https://www.cnbc.com/2018/12/05/waymo-starts-commercial-ride-share-service.html (accessed on 30 January 2023).
- Abbasi, A.; Queirós, S.; da Costa, N.M.; Fonseca, J.C.; Borges, J. Sensor Fusion Approach for Multiple Human Motion Detection for Indoor Surveillance Use-Case. Sensors 2023, 23, 3993. [Google Scholar] [CrossRef] [PubMed]
- Melo, C.; Dixe, S.; Fonseca, J.C.; Moreira, A.H.; Borges, J. Ai based monitoring of different risk levels in covid19 context. Sensors 2022, 22, 298. [Google Scholar] [CrossRef] [PubMed]
- Torres, H.R.; Oliveira, B.; Fonseca, J.; Queirós, S.; Borges, J.; Rodrigues, N.; Coelho, V.; Pallauf, J.; Brito, J.; Mendes, J. Real-Time Human Body Pose Estimation for In-Car Depth Images. IFIP Adv. Inf. Commun. Technol. 2019, 553, 169–182. [Google Scholar] [CrossRef]
- Dixe, S.; Sousa, J.; Fonseca, J.C.; Moreira, A.H.; Borges, J. Optimized in-vehicle multi person human body pose detection. Procedia Comput. Sci. 2022, 204, 479–487. [Google Scholar] [CrossRef]
- Borges, J.; Oliveira, B.; Torres, H.; Rodrigues, N.; Queirós, S.; Shiller, M.; Coelho, V.; Pallauf, J.; Brito, J.H.; Mendes, J.; et al. Automated generation of synthetic in-car dataset for human body pose detection. In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Valletta, Malta, 27–29 February 2020; Volume 5, pp. 550–557. [Google Scholar] [CrossRef]
- Borges, J.; Queirós, S.; Oliveira, B.; Torres, H.; Rodrigues, N.; Coelho, V.; Pallauf, J.; Brito, J.H.; Mendes, J.; Fonseca, J.C. A system for the generation of in-car human body pose datasets. Mach. Vis. Appl. 2021, 32, 4. [Google Scholar] [CrossRef]
- Dixe, S.; Leite, J.; Fonseca, J.C.; Borges, J. BigGAN evaluation for the generation of vehicle interior images. Procedia Comput. Sci. 2022, 204, 548–557. [Google Scholar] [CrossRef]
- Dixe, S.; Leite, J.; Azadi, S.; Fariae, P.; Mendes, J.; Fonseca, J.C.; Borges, J. In-car damage dirt and stain estimation with RGB images. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Online, 4–6 February 2021; Volume 2, pp. 672–679. [Google Scholar] [CrossRef]
- Faria, P.; Dixe, S.; Leite, J.; Azadi, S.; Mendes, J.; Fonseca, J.C.; Borges, J. In-Car State Classification with RGB Images. In Proceedings of the Intelligent Systems Design and Applications: 20th International Conference on Intelligent Systems Design and Applications (ISDA 2020), Online, 12–15 December 2021; pp. 435–445. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo Vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar] [CrossRef]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; Lecun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
- SravyaPranati, B.; Suma, D.; ManjuLatha, C.; Putheti, S. Large-Scale Video Classification with Convolutional Neural Networks. Smart Innov. Syst. Technol. 2021, 196, 689–695. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The Kinetics Human Action Video Dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A Short Note about Kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]
- Smaira, L.; Carreira, J.; Noland, E.; Clancy, E.; Wu, A.; Zisserman, A. A Short Note on the Kinetics-700-2020 Human Action Dataset. arXiv 2020, arXiv:2010.10864. [Google Scholar]
- Rodriguez, M.D.; Ahmed, J.; Shah, M. Action MACH: A spatio-temporal maximum average correlation height filter for action recognition. In Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Anchorage, AK, USA, 23–28 June 2008. [Google Scholar] [CrossRef]
- Soomro, K.; Zamir, A.R. Action recognition in realistic sports videos. Adv. Comput. Vis. Pattern Recognit. 2014, 71, 181–208. [Google Scholar] [CrossRef]
- Kuehne, H.; Jhuang, H.; Stiefelhagen, R.; Serre Thomas, T. Hmdb51: A large video database for human motion recognition. In High Performance Computing in Science and Engineering’12: Transactions of the High Performance Computing Center, Stuttgart (HLRS) 2012; Springer: Berlin/Heidelberg, Germany, 2013; pp. 571–582. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Lecture Notes in Computer Science. Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar] [CrossRef]
- Lin, J.; Gan, C.; Han, S. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Zhang, W.; Zhao, D.; Xu, L.; Li, Z.; Gong, W.; Zhou, J. Distributed embedded deep learning based real-time video processing. In Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2016, Budapest, Hungary, 9–12 October 2016; pp. 1945–1950. [Google Scholar] [CrossRef]
- Oro, D.; Fernandez, C.; Martorell, X.; Hernando, J. Work-efficient parallel non-maximum suppression for embedded GPU architectures. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 1026–1030. [Google Scholar] [CrossRef]
- Fitzpatrick, T.B. The Validity and Practicality of Sun-Reactive Skin Types I Through VI. Arch. Dermatol. 1988, 124, 869–871. [Google Scholar] [CrossRef] [PubMed]
Process | Algorithm | Description |
---|---|---|
Merge | mergedatasets.ipynb | Merge different datasets |
fixclasses.ipynb | Find and fix duplicate classes | |
cleanclasses.ipynb | Remove classes with missing annotations and images | |
cleanimages.ipynb | Remove images missing from dataset folder | |
Fusion | mixclasses.ipynb | Mix/Fusion of classes into other classes & reorder class ids |
Split | splitbyannotations.ipynb | Split using annotations in test, val and train |
splitbyimages.ipynb | Split using images in test, val and train | |
reorderids.ipynb | Reorder class ids | |
Check | checkmissings.ipynb | Check missing images, videos and annotations |
Export | json2yolo.ipynb | Exports dataset to yolo format |
json2mmaction2.ipynb | Exports dataset to mmaction2 format |
Experiment | Methods | Top-1 Accuracy | Mean Class Accuracy |
---|---|---|---|
E1 | I3D | 0.7673 | 0.7737 |
E2 | R(2+1)D | 0.7737 | 0.7737 |
E3 | SlowFast | 0.5524 | 0.5546 |
E4 | TSN | 0.9540 | 0.9544 |
E5 | TSM | 0.9437 | 0.9432 |
Feature | Experiment | Top-1 Accuracy | Mean Class Accuracy | Epoch |
---|---|---|---|---|
thermal | E4 | 0.9519 | 0.9452 | 20 |
thermal | E5 | 0.9500 | 0.9426 | 85 |
thermal | E5 (no pre-train) | 0.8731 | 0.8402 | 75 |
thermal | E4 (no pre-train) | 0.8846 | 0.8572 | 75 |
rgb | E4 | 0.9422 | 0.9263 | 5 |
rgb | E5 | 0.9827 | 0.9852 | 10 |
rgb | E5 (no pre-train) | 0.9730 | 0.9788 | 40 |
rgb | E4 (no pre-train) | 0.9634 | 0.9667 | 65 |
nvs | E4 | 0.8990 | 0.8891 | 35 |
nvs | E5 | 0.9732 | 0.9739 | 75 |
nvs | E5 (no pre-train) | 0.8804 | 0.8721 | 75 |
nvs | E4 (no pre-train) | 0.7216 | 0.6472 | 10 |
opticalflow | E4 | 0.9692 | 0.9633 | 50 |
opticalflow | E5 | 0.9885 | 0.9909 | 75 |
opticalflow | E5 (no pre-train) | 0.8962 | 0.8730 | 65 |
opticalflow | E4 (no pre-train) | 0.9077 | 0.8878 | 85 |
Experiment | Precision | Recall | mAP_0.5 | mAP_0.5:0.95 |
---|---|---|---|---|
E1 | 0.126 | 0.225 | 0.0687 | 0.0331 |
E2 | 0.126 | 0.190 | 0.0597 | 0.0278 |
E3 | 0.143 | 0.262 | 0.0770 | 0.0375 |
E4 | 0.259 | 0.267 | 0.1410 | 0.0627 |
E5 | 0.546 | 0.662 | 0.5790 | 0.3810 |
Classes | Labels Test Split | Precision | Recall | mAP_0.5 | mAP_0.5:0.95 |
---|---|---|---|---|---|
all | 1469 | 0.479 | 0.622 | 0.481 | 0.314 |
knife | 1015 | 0.264 | 0.337 | 0.178 | 0.107 |
weapon | 43 | 0.735 | 0.907 | 0.84 | 0.604 |
bat | 411 | 0.439 | 0.623 | 0.426 | 0.232 |
Classes | Labels Test Split | Precision | Recall | mAP_0.5 | mAP_0.5:0.95 |
---|---|---|---|---|---|
all | 64,509 | 0.238 | 0.468 | 0.23 | 0.161 |
person | 50,906 | 0.245 | 0.707 | 0.285 | 0.214 |
knife | 1015 | 0.178 | 0.274 | 0.138 | 0.0929 |
weapon | 43 | 0.734 | 0.9 | 0.773 | 0.525 |
bat | 411 | 0.124 | 0.453 | 0.118 | 0.0725 |
bag | 2332 | 0.0961 | 0.237 | 0.083 | 0.0549 |
book | 2733 | 0.199 | 0.141 | 0.113 | 0.0714 |
phone | 881 | 0.186 | 0.46 | 0.193 | 0.133 |
laptop | 551 | 0.177 | 0.682 | 0.229 | 0.199 |
mouse | 242 | 0.205 | 0.657 | 0.196 | 0.152 |
keyboard | 306 | 0.215 | 0.497 | 0.186 | 0.139 |
bottle | 2738 | 0.159 | 0.389 | 0.161 | 0.116 |
cable | 243 | 0.616 | 0.646 | 0.617 | 0.388 |
banana | 959 | 0.11 | 0.234 | 0.0773 | 0.0472 |
apple | 491 | 0.191 | 0.483 | 0.184 | 0.134 |
orange | 658 | 0.132 | 0.266 | 0.0959 | 0.071 |
Model | Num Classes | Inference Time (ms) |
---|---|---|
Yolo_S | 3 | 120 |
Yolo_M | 3 | 185 |
Yolo_L | 3 | 300 |
Yolo_X | 3 | 580 |
Model | Num Classes | Inference Time (ms) |
---|---|---|
Yolo_S | 1298 | 175 |
Yolo_S | 15 | 130 |
Yolo_M | 15 | 270 |
Yolo_L | 15 | 330 |
Model | FP32 (s) | FP16 (s) | INT8 (s) |
---|---|---|---|
Yolo_S | 0.024021 | 0.015262 | 0.010740 |
Yolo_S | 0.023381 | 0.013463 | 0.010827 |
Yolo_S | 0.022659 | 0.014394 | 0.012740 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rodrigues, N.R.P.; da Costa, N.M.C.; Melo, C.; Abbasi, A.; Fonseca, J.C.; Cardoso, P.; Borges, J. Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors 2023, 23, 5610. https://doi.org/10.3390/s23125610
Rodrigues NRP, da Costa NMC, Melo C, Abbasi A, Fonseca JC, Cardoso P, Borges J. Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors. 2023; 23(12):5610. https://doi.org/10.3390/s23125610
Chicago/Turabian StyleRodrigues, Nelson R. P., Nuno M. C. da Costa, César Melo, Ali Abbasi, Jaime C. Fonseca, Paulo Cardoso, and João Borges. 2023. "Fusion Object Detection and Action Recognition to Predict Violent Action" Sensors 23, no. 12: 5610. https://doi.org/10.3390/s23125610
APA StyleRodrigues, N. R. P., da Costa, N. M. C., Melo, C., Abbasi, A., Fonseca, J. C., Cardoso, P., & Borges, J. (2023). Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors, 23(12), 5610. https://doi.org/10.3390/s23125610