EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment
<p>Optimal Frame Selection using the contextual information from the optical flow stream. It can be seen that some parts of the video can be more discriminative; frame 1 contains more information than frame 2 and frame 3, which can be observed from the histogram. The green graph is more stable than the other three frames because it has lower variance in the pixels as compared to the other frames.</p> "> Figure 2
<p>Flops vs. Accuracy: UCF101 (<b>Right</b>) and HMDB51 (<b>Left</b>). Here, MoviNet shows our proposed model (Red Diamond). In HMDB51, our model has outperformed all methods in both accuracy and computational complexity by a sufficient margin. Whereas, in UCF101, our model is giving fewer FLOPS and parameters at the cost of 12% accuracy.</p> "> Figure 3
<p>MoviNet-A2 with Causal Stream Buffer. Block layer contains many inner convolutional operations that are equal to the single red block (lock11, block12, …). The causal operation has been applied between the non-overlapping sub-clips to expand the receptive fields using the stream buffer concept. In each block layer, we have used the EvoNorm and Mish activation function to mitigate the generalization error [<a href="#B36-sensors-23-08106" class="html-bibr">36</a>].</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Computation and Memory Efficient Networks
2.2. Mobile Video Network (MoviNet)
2.2.1. Motion Emulating RGB Stream (MERS)
2.2.2. Motion-Augmented RGB Stream (MARS)
2.3. Optimal Frame Selection
3. Proposed Framework
3.1. Backbone
3.2. Stream Buffers and Causal Convolution
3.3. Generalization Gap
3.4. Optimal Frame Selection Using Optical Flow
4. Datasets and Metrics
Metrics
5. Experiments and Analysis
5.1. Training Details
5.2. Inference
6. Results and Discussion
6.1. Discussion
6.1.1. Adopting Mish Activation Function within EvoNorm: Boosting Network Performance by Replacing Swish
6.1.2. Combining TVL1 and Local Binary Patterns for Precise Key-Frame Identification in Optical Flow
6.2. Limitations
6.3. Results
Model | Accuracy | GFlops | Parameters |
---|---|---|---|
EMO-MoviNet-A0 | 71.39 | 2.71 G | 3.1 M |
EMO-MoviNet-A0* | 74.83 | 2.73 G | 4.6 M |
RNxt101-R [52] | 63.8 | 76.68 G | 47.6 M |
RNxt101-F [52] | 71.2 | 67.88 G | 47.6 M |
EMO-MoviNet-A1 | 74.38 | 6.022 G | 4.6 M |
EMO-MoviNet-A1* | 77.42 | 6.06 G | 4.6 M |
MERS | 71.8 | 6.68 G | 47.6 M |
MERS-R | 72.9 | 76.88 G | 47.6 M |
MERS-F | 72.4 | 67.88 | 47.6 M |
MERS-R + F | 74.5 | - | 47.6 M |
EMO-MoviNet-A2 | 79.36 | 10.3 G | 4.8 M |
EMO-MoviNet-A2* | 80.6 | 10.4 G | 4.8 M |
MARS | 72.8 | 76.68 G | 47.6 M |
MARS-R | 73.1 | 76.88 G | 47.6 M |
MARS-F | 74.5 | 67.88 | 47.6 M |
MERS-R + F | 75 | - | 47.6 M |
6.3.1. EMO-MoviNet-A2 (20 Classes-UCF101)
6.3.2. EMO-MOVINET (A0–A2)
Model | Pre-Trained | UCF101 | HMDB51 |
---|---|---|---|
IDT [28] | - | 86.4% | 61.7% |
C3D + IDT | Sports1M | 90.4% | - |
TDD + IDT | - | 91.5% | 65.9 |
DIN + IDTD | - | 89.1% | 65.2% |
ResNext + RGB | Kinetics | 91.7% | 75.5% |
C3D | Sports1M | 90.4% | 65.4% |
Two-stream [6] | Imagenet | 88.0% | 59.4% |
ActionFlowNet | - | 83.9% | 56.4% |
EMO-MoviNet-A2 | Kintics600 | 89.7% | 79.3% |
EMO-MoviNet-A2* | Kintics600 | 91.8% | 81.53% |
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Brezeale, D.; Cook, D.J. Automatic video classification: A survey of the literature. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 2008, 38, 416–430. [Google Scholar] [CrossRef]
- Nanni, L.; Ghidoni, S.; Brahnam, S. Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recognit. 2017, 71, 158–172. [Google Scholar] [CrossRef]
- Kondratyuk, D.; Yuan, L.; Li, Y.; Zhang, L.; Tan, M.; Brown, M.; Gong, B. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16020–16030. [Google Scholar]
- Qiu, Z.; Yao, T.; Ngo, C.W.; Tian, X.; Mei, T. Learning spatio-temporal representation with local and global diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12056–12065. [Google Scholar]
- Ali, S.; Shah, M. Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 32, 288–303. [Google Scholar] [CrossRef] [PubMed]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–11 December 2014; Volume 27. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Cheng, J.; Lu, H. Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 2021, 30, 7333–7348. [Google Scholar] [CrossRef] [PubMed]
- Fan, X.; Qureshi, R.; Shahid, A.R.; Cao, J.; Yang, L.; Yan, H. Hybrid Separable Convolutional Inception Residual Network for Human Facial Expression Recognition. In Proceedings of the 2020 International Conference on Machine Learning and Cybernetics (ICMLC), Adelaide, Australia, 2 December 2020; pp. 21–26. [Google Scholar] [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Crasto, N.; Weinzaepfel, P.; Alahari, K.; Schmid, C. Mars: Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7882–7891. [Google Scholar]
- Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 156–165. [Google Scholar]
- Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
- Kohler, J.; Daneshmand, H.; Lucchi, A.; Zhou, M.; Neymeyr, K.; Hofmann, T. Towards a theoretical understanding of batch normalization. Stat 2018, 1050, 27. [Google Scholar]
- Liu, H.; Brock, A.; Simonyan, K.; Le, Q. Evolving normalization-activation layers. Adv. Neural Inf. Process. Syst. 2020, 33, 13539–13550. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Nawaz, M.; Qureshi, R.; Teevno, M.A.; Shahid, A.R. Object detection and segmentation by composition of fast fuzzy C-mean clustering based maps. J. Ambient Intell. Humaniz. Comput. 2023, 14, 7173–7188. [Google Scholar] [CrossRef]
- Hafiz, A.M.; Bhat, G.M. A survey on instance segmentation: State of the art. Int. J. Multimed. Inf. Retr. 2020, 9, 171–189. [Google Scholar] [CrossRef]
- Rachmadi, R.F.; Uchimura, K.; Koutaki, G. Video classification using compacted dataset based on selected keyframe. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 873–878. [Google Scholar]
- Brattoli, B.; Tighe, J.; Zhdanov, F.; Perona, P.; Chalupka, K. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4613–4623. [Google Scholar]
- Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar] [CrossRef]
- He, J.Y.; Wu, X.; Cheng, Z.Q.; Yuan, Z.; Jiang, Y.G. DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition. Neurocomputing 2021, 444, 319–331. [Google Scholar] [CrossRef]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Kawaguchi, K.; Kaelbling, L.P.; Bengio, Y. Generalization in deep learning. arXiv 2017, arXiv:1710.05468. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv 2017, arXiv:1708.05038. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 18–23 June 2014; pp. 1725–1732. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
- Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Ji, L.; Zhang, J.; Zhang, C.; Ma, C.; Xu, S.; Sun, K. CondenseNet with exclusive lasso regularization. Neural Comput. Appl. 2021, 33, 16197–16212. [Google Scholar] [CrossRef] [PubMed]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Jiang, Y.; Krishnan, D.; Mobahi, H.; Bengio, S. Predicting the generalization gap in deep networks with margin distributions. arXiv 2018, arXiv:1810.00113. [Google Scholar]
- Ioffe, S. Batch renormalization: Towards reducing minibatch dependence in batch- normalized models. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Wolf, W. Key frame selection by motion analysis. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; IEEE: Piscataway, NJ, USA, 1996; Volume 2, pp. 1228–1231. [Google Scholar]
- Lindeberg, T. Scale invariant feature transform. Scholarpedia 2012, 7, 10491. [Google Scholar] [CrossRef]
- Yan, X.; Gilani, S.Z.; Feng, M.; Zhang, L.; Qin, H.; Mian, A. Self-supervised learning to detect key frames in videos. Sensors 2020, 20, 6941. [Google Scholar] [CrossRef] [PubMed]
- Joulin, A.; Paris, F. Facebook AI Research. Learning Visual Features from Large Weakly Supervised Data. arXiv 2015, arXiv:1511.02251. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2820–2828. [Google Scholar]
- Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Rakêt, L.L.; Roholm, L.; Nielsen, M.; Lauze, F. TV-L 1 optical flow for vector valued images. In Proceedings of the Energy Minimization Methods in Computer Vision and Pattern Recognition: 8th International Conference, EMMCVPR 2011, St. Petersburg, Russia, 25–27 July 2011; Proceedings 8. Springer: Berlin/Heidelberg, Germany, 2011; pp. 329–343. [Google Scholar]
- Hara, K.; Kataoka, H.; Satoh, Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6546–6555. [Google Scholar]
- Jiang, S.; Qi, Y.; Zhang, H.; Bai, Z.; Lu, X.; Wang, P. D3d: Dual 3-d convolutional network for real-time action recognition. IEEE Trans. Ind. Inform. 2020, 17, 4584–4593. [Google Scholar] [CrossRef]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking Spatiotemporal Feature Learning for Video Understanding. arXiv 2017, arXiv:1712.04851. [Google Scholar]
- Wang, L.; Li, W.; Li, W.; Van Gool, L. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1430–1439. [Google Scholar]
- Zhu, L.; Tran, D.; Sevilla-Lara, L.; Yang, Y.; Feiszli, M.; Wang, H. Faster recurrent networks for efficient video classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13098–13105. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. arXiv 2019, arXiv:cs.CV/1811.08383. [Google Scholar]
- Jiang, B.; Wang, M.; Gan, W.; Wu, W.; Yan, J. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2000–2009. [Google Scholar]
- Kwon, H.; Kim, M.; Kwak, S.; Cho, M. Motionsqueeze: Neural motion feature learning for video understanding. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 345–362. [Google Scholar]
- Li, X.; Zhang, Y.; Liu, C.; Shuai, B.; Zhu, Y.; Brattoli, B.; Chen, H.; Marsic, I.; Tighe, J. Vidtr: Video transformer without convolutions. arXiv 2021, arXiv:2104.11746. [Google Scholar]
Model/Activation | Accuracy |
---|---|
MoviNet-A2* Swish | 87.52% |
MoviNet-A2* Mish | 88.11% |
Model/Activation | Pre-Trained | Accuracy |
---|---|---|
MoviNet A2* Random Frame Selection | Kinetic-700 | 87.25% |
MoviNet-A2* Optimal Frame Selection | Kinetic-700 | 87.76% |
Model | Pre-Trained | UCF101 | HMDB51 | GFlops | Parameters |
---|---|---|---|---|---|
STAM16 [12] | - | 97% | - | 270 G | 96 M |
I3D [9] | Imagenet-kinetics | 95.6 | 74.8 | 108 G | 28 M |
S3D [54] | Imagenet-kinetics | 96.8 | 75.9 | 66.4 G | - |
R(2 + 1)D [33] | Sports1M | 93.6% | 66.6% | 152 G | - |
Artnet [55] | Kinetics600 | 94.3% | 70.9% | - | - |
FASTER32 [56] | Kinetics600 | 96.9% | 75.7% | 67.7 G | - |
TSM-R50 [57] | - | 95.9% | 73.5% | 65 G | 24.3 M |
STM-R50 [58] | - | 96.2% | 72.2% | 67 G | 24 M |
MsNet-R50 [59] | - | - | 77.4% | 67.9 G | 24.6 M |
VidTr-L [60] | - | 96.7% | 74.4% | 59 G | - |
MARS-RGB [10] | Kinetics600 | 94.6% | 76.68G | 47.63 M | |
MARS-Flow [10] | Kinetics600 | 95.6% | 74.5% | 66.88 G | - |
EMO-MoviNet-A2 | Kintics600 | 89.7% | 79.3% | 10.4 G | 4.8 M |
EMO-MoviNet-A2 * | Kintics600 | 91.8% | 81.53% | 10.3 G | 4.8 M |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hussain, T.; Memon, Z.A.; Qureshi, R.; Alam, T. EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment. Sensors 2023, 23, 8106. https://doi.org/10.3390/s23198106
Hussain T, Memon ZA, Qureshi R, Alam T. EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment. Sensors. 2023; 23(19):8106. https://doi.org/10.3390/s23198106
Chicago/Turabian StyleHussain, Tarique, Zulfiqar Ali Memon, Rizwan Qureshi, and Tanvir Alam. 2023. "EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment" Sensors 23, no. 19: 8106. https://doi.org/10.3390/s23198106