Video Activity Recognition: State-of-the-Art
Abstract
:1. Introduction
- Apart from spatial information, temporal context across frames is also required.
- Huge computational cost.
- Datasets are more limited, due to the difficulty to collect, annotate and store videos.
2. Used Techniques
2.1. Methods Using Hand-Crafted Motion Features
- The first value indicates the presence of motion and where it occurs by a binary motion-energy image (MEI). Being a binary image sequence and r the value that defines the temporal extent of a movement, the binary image is defined this way:
- The second value is a scalar-valued image where intensity is a function of recency of motion of the sequence, represented by a motion-history image (MHI) which indicates how the image is moving. represents the temporal history of motion at each point, where recently moved pixels are brighter:
2.2. Depth Information Based Methods
2.3. Deep Learning Based Methods
3. Benchmark Datasets
3.1. UCF-101
- Human–Object Interaction: twenty categories.
- Body-Motion Only: sixteen categories.
- Human–Human Interaction: five categories.
- Playing Musical Instruments: ten categories.
- Sports: fifty categories.
3.2. HMDB51
- General facial actions: smile, laugh, chew, talk.
- Facial actions with object manipulation: smoke, eat, drink.
- General body movements: cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk, wave.
- Body movements with object interaction: brush hair, catch, draw sword, dribble, golf, hit something, kick ball, pick, pour, push something, ride bike, ride horse, shoot ball, shoot bow, shoot gun, swing baseball bat, sword exercise, throw.
- Body movements for human interaction: fencing, hug, kick someone, kiss, punch, shake hands, sword fight.
3.3. Weizmann
3.4. MSRAction3D
3.5. ActivityNet
3.6. Something Something
3.7. Sports-1M
3.8. AVA
4. Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Avci, A.; Bosch, S.; Marin-Perianu, M.; Marin-Perianu, R.; Havinga, P. Activity recognition using inertial sensing for healthcare, wellbeing and sports applications: A survey. In Proceedings of the 23th International Conference on Architecture of Computing Systems 2010, Hannover, Germany, 22–23 February 2010; pp. 1–10. [Google Scholar]
- Mulroy, S.; Gronley, J.; Weiss, W.; Newsam, C.; Perry, J. Use of cluster analysis for gait pattern classification of patients in the early and late recovery phases following stroke. Gait Posture 2003, 18, 114–125. [Google Scholar] [CrossRef]
- Rautaray, S.S.; Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 2015, 43, 1–54. [Google Scholar] [CrossRef]
- Mitra, S.; Acharya, T. Gesture recognition: A survey. IEEE Trans. Syst. Man Cybern. Part Appl. Rev. 2007, 37, 311–324. [Google Scholar] [CrossRef]
- Vishwakarma, S.; Agrawal, A. A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 2013, 29, 983–1009. [Google Scholar] [CrossRef]
- Leo, M.; D’Orazio, T.; Spagnolo, P. Human activity recognition for automatic visual surveillance of wide areas. In Proceedings of the ACM 2nd International Workshop on Video Surveillance & Sensor Networks, New York, NY, USA, 15 October 2004; pp. 124–130. [Google Scholar]
- Coppola, C.; Cosar, S.; Faria, D.R.; Bellotto, N. Social Activity Recognition on Continuous RGB-D Video Sequences. Int. J. Soc. Robot. 2019, 1–15. [Google Scholar] [CrossRef]
- Coppola, C.; Faria, D.R.; Nunes, U.; Bellotto, N. Social activity recognition based on probabilistic merging of skeleton features with proximity priors from RGB-D data. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 5055–5061. [Google Scholar]
- Lin, W.; Sun, M.T.; Poovandran, R.; Zhang, Z. Human activity recognition for video surveillance. In Proceedings of the 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 18–21 May 2008; pp. 2737–2740. [Google Scholar]
- Nair, V.; Clark, J.J. Automated visual surveillance using Hidden Markov Models. International Conference on Vision Interface. 2002, pp. 88–93. Available online: https://pdfs.semanticscholar.org/8fcf/7e455419fac79d65c62a3e7f39a945fa5be0.pdf (accessed on 15 July 2019).
- Ma, M.; Meyer, B.J.; Lin, L.; Proffitt, R.; Skubic, M. VicoVR-Based Wireless Daily Activity Recognition and Assessment System for Stroke Rehabilitation. In Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Madrid, Spain, 3–6 December 2018; pp. 1117–1121. [Google Scholar]
- Ke, S.R.; Thuc, H.; Lee, Y.J.; Hwang, J.N.; Yoo, J.H.; Choi, K.H. A review on video-based human activity recognition. Computers 2013, 2, 88–131. [Google Scholar] [CrossRef]
- Dawn, D.D.; Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 2016, 32, 289–306. [Google Scholar] [CrossRef]
- Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar] [CrossRef] [Green Version]
- Kumar, S.S.; John, M. Human activity recognition using optical flow based feature set. In Proceedings of the 2016 IEEE International Carnahan Conference on Security Technology (ICCST), Orlando, FL, USA, 24–27 October 2016; pp. 1–5. [Google Scholar]
- Guo, K.; Ishwar, P.; Konrad, J. Action recognition using sparse representation on covariance manifolds of optical flow. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA, 29 August–1 September 2010; pp. 188–195. [Google Scholar]
- Niu, F.; Abdel-Mottaleb, M. HMM-based segmentation and recognition of human activities from video sequences. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6 July 2005; pp. 804–807. [Google Scholar]
- Raman, N.; Maybank, S.J. Activity recognition using a supervised non-parametric hierarchical HMM. Neurocomputing 2016, 199, 163–177. [Google Scholar] [CrossRef] [Green Version]
- Liciotti, D.; Duckett, T.; Bellotto, N.; Frontoni, E.; Zingaretti, P. HMM-based activity recognition with a ceiling RGB-D camera. In Proceedings of the ICPRAM—6th International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017. [Google Scholar]
- Ma, M.; Fan, H.; Kitani, K.M. Going deeper into first-person activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1894–1903. [Google Scholar]
- Nunez, J.C.; Cabido, R.; Pantrigo, J.J.; Montemayor, A.S.; Velez, J.F. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 2018, 76, 80–94. [Google Scholar] [CrossRef]
- Sadanand, S.; Corso, J.J. Action bank: A high-level representation of activity in video. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1234–1241. [Google Scholar]
- Ng, J.Y.H.; Davis, L.S. Temporal difference networks for video action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2018; pp. 1587–1596. [Google Scholar]
- Lan, T.; Sigal, L.; Mori, G. Social roles in hierarchical models for human activity recognition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1354–1361. [Google Scholar]
- Vahora, S.; Chauhan, N. Deep neural network model for group activity recognition using contextual relationship. Eng. Sci. Technol. Int. J. 2019, 22, 47–54. [Google Scholar] [CrossRef]
- Huang, S.C. An advanced motion detection algorithm with video quality analysis for video surveillance systems. IEEE Trans. Circuits Syst. Video Technol. 2010, 21, 1–14. [Google Scholar] [CrossRef]
- Hu, W.; Tan, T.; Wang, L.; Maybank, S. A survey on visual surveillance of object motion and behaviors. IEEE Trans. Syst. Man Cybern. Part Appl. Rev. 2004, 34, 334–352. [Google Scholar] [CrossRef]
- Gaba, N.; Barak, N.; Aggarwal, S. Motion detection, tracking and classification for automated Video Surveillance. In Proceedings of the 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), Delhi, India, 4–6 July 2016; pp. 1–5. [Google Scholar]
- Trucco, E.; Plakas, K. Video tracking: a concise survey. IEEE J. Ocean. Eng. 2006, 31, 520–529. [Google Scholar] [CrossRef]
- Maggio, E.; Cavallaro, A. Video Tracking: Theory and Practice; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Del Rincón, J.M.; Santofimia, M.J.; Nebel, J.C. Common-sense reasoning for human action recognition. Pattern Recognit. Lett. 2013, 34, 1849–1860. [Google Scholar] [CrossRef] [Green Version]
- Santofimia, M.J.; Martinez-del Rincon, J.; Nebel, J.C. Episodic reasoning for vision-based human action recognition. Sci. World J. 2014, 2014. [Google Scholar] [CrossRef]
- Onofri, L.; Soda, P.; Pechenizkiy, M.; Iannello, G. A survey on using domain and contextual knowledge for human activity recognition in video streams. Expert Syst. Appl. 2016, 63, 97–111. [Google Scholar] [CrossRef]
- Wang, X.; Gao, L.; Song, J.; Zhen, X.; Sebe, N.; Shen, H.T. Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 2018, 275, 438–447. [Google Scholar] [CrossRef]
- Aggarwal, J.K.; Ryoo, M.S. Human activity analysis: A review. ACM Comput. Surv. (CSUR) 2011, 43, 16. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human Action Recognition and Prediction: A Survey. arXiv 2018, arXiv:1806.11230. [Google Scholar]
- Raptis, M.; Sigal, L. Poselet key-framing: A model for human activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2650–2657. [Google Scholar]
- Wang, Y.; Sun, S.; Ding, X. A self-adaptive weighted affinity propagation clustering for key frames extraction on human action recognition. J. Vis. Commun. Image Represent. 2015, 33, 193–202. [Google Scholar] [CrossRef]
- Niebles, J.C.; Wang, H.; Fei-Fei, L. Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 2008, 79, 299–318. [Google Scholar] [CrossRef]
- Dollár, P.; Rabaud, V.; Cottrell, G.; Belongie, S. Behavior recognition via sparse spatio-temporal features. In Proceedings of the 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, 15–16 October 2005; pp. 65–72. [Google Scholar]
- Bregonzio, M.; Gong, S.; Xiang, T. Recognising action as clouds of space-time interest points. In Proceedings of the CVPR 2009, Miami Beach, FL, USA, 20–25 June 2009; Volume 9, pp. 1948–1955. [Google Scholar]
- Laptev, I.; Marszalek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Ngo, C.W.; Pong, T.C.; Zhang, H.J. Motion-based video representation for scene change detection. Int. J. Comput. Vis. 2002, 50, 127–142. [Google Scholar] [CrossRef]
- Sand, P.; Teller, S. Particle video: Long-range motion estimation using point trajectories. Int. J. Comput. Vis. 2008, 80, 72. [Google Scholar] [CrossRef]
- Lertniphonphan, K.; Aramvith, S.; Chalidabhongse, T.H. Human action recognition using direction histograms of optical flow. In Proceedings of the 2011 11th International Symposium on Communications & Information Technologies (ISCIT), Hangzhou, China, 12–14 October 2011; pp. 574–579. [Google Scholar]
- Chaudhry, R.; Ravichandran, A.; Hager, G.; Vidal, R. Histograms of oriented optical flow and Binet–Cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1932–1939. [Google Scholar]
- Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 257–267. [Google Scholar] [CrossRef]
- Bobick, A.; Davis, J. An appearance-based representation of action. In Proceedings of the 1996 International Conference on Pattern Recognition (ICPR ’96), Washington, DC, USA, 25–30 August 1996; pp. 307–312. [Google Scholar]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04), Washington, DC, USA, 23–26 August 2004; pp. 32–36. [Google Scholar]
- Laptev, I. On space-time interest points. Int. J. Comput. Vis. 2005, 64, 107–123. [Google Scholar] [CrossRef]
- Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
- Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed]
- Wallraven, C.; Caputo, B.; Graf, A. Recognition with local features: The kernel recipe. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 3–16 October 2003; p. 257. [Google Scholar]
- Wof, L.; Shashua, A. Kernel principal angles for classification machines with applications to image sequence interpretation. In Proceedings of the 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Madison, WI, USA, 8–20 June 2003. [Google Scholar]
- Niebles, J.C.; Fei-Fei, L. A hierarchical model of shape and appearance for human action classification. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Bouchard, G.; Triggs, B. Hierarchical part-based visual object categorization. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 710–715. [Google Scholar]
- Bosch, A.; Zisserman, A.; Munoz, X. Representing shape with a spatial pyramid kernel. In Proceedings of the 6th ACM International Conference on Image and Video Retrieval, Amsterdam, The Netherlands, 9–11 July 2007; pp. 401–408. [Google Scholar]
- Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 7–22 June 2006; pp. 2169–2178. [Google Scholar]
- Marszałek, M.; Schmid, C.; Harzallah, H.; Van De Weijer, J. Learning object representations for visual object class recognition. In Proceedings of the Visual Recognition Challange Workshop, in Conjunction with ICCV, Rio de Janeiro, Brazil, October 2007; Available online: https://hal.inria.fr/inria-00548669/ (accessed on 15 July 2019).
- Zhang, J.; Marszałek, M.; Lazebnik, S.; Schmid, C. Local features and kernels for classification of texture and object categories: A comprehensive study. Int. J. Comput. Vis. 2007, 73, 213–238. [Google Scholar] [CrossRef]
- Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the 4th Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; pp. 10–5244. [Google Scholar]
- Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef] [Green Version]
- Chen, C.C.; Aggarwal, J. Recognizing human action from a far field of view. In Proceedings of the 2009 Workshop on Motion and Video Computing (WMVC), Snowbird, UT, USA, 8–9 December 2009; pp. 1–7. [Google Scholar]
- Blank, M.; Gorelick, L.; Shechtman, E.; Irani, M.; Basri, R. Actions as space-time shapes. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Beijing, China, 17–21 October 2005; pp. 1395–1402. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
- Hatun, K.; Duygulu, P. Pose sentences: A new representation for action recognition using sequence of pose words. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
- Li, X. HMM based action recognition using oriented histograms of optical flow field. Electron. Lett. 2007, 43, 560–561. [Google Scholar] [CrossRef]
- Lu, W.L.; Little, J.J. Simultaneous tracking and action recognition using the PCA-HOG descriptor. In Proceedings of the 3rd Canadian Conference on Computer and Robot Vision (CRV’06), Quebec City, QC, Canada, 7–9 June 2006; p. 6. [Google Scholar]
- Thurau, C. Behavior histograms for action recognition and human detection. In Human Motion–Understanding, Modeling, Capture and Animation; Springer: Berlin/Heidelberg, Germany, 2007; pp. 299–312. [Google Scholar]
- Santiago-Mozos, R.; Leiva-Murillo, J.M.; Pérez-Cruz, F.; Artes-Rodriguez, A. Supervised-PCA and SVM classifiers for object detection in infrared images. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, Washington, DC, USA, 21–22 July 2003; pp. 122–127. [Google Scholar]
- Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. Acm Trans. Intell. Syst. Technol. TIST 2011, 2, 27. [Google Scholar] [CrossRef]
- Vishwanathan, S.; Smola, A.J.; Vidal, R. Binet–Cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. Int. J. Comput. Vis. 2007, 73, 95–119. [Google Scholar] [CrossRef]
- Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. 1981. Available online: https://www.researchgate.net/publication/215458777_An_Iterative_Image_Registration_Technique_with_an_Application_to_Stereo_Vision_IJCAI (accessed on 15 July 2019).
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Wang, H.; Schmid, C. Action Recognition with Improved Trajectories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013. [Google Scholar]
- Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany; pp. 404–417. [Google Scholar]
- Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis, Halmstad, Sweden, 29 June–2 July 2003; Springer: Berlin/Heidelberg, Germany; pp. 363–370. [Google Scholar]
- Prest, A.; Schmid, C.; Ferrari, V. Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 601–614. [Google Scholar] [CrossRef] [PubMed]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
- Akpinar, S.; Alpaslan, F.N. Video action recognition using an optical flow based representation. In Proceedings of theIPCV’14—The 2014 International Conference on Image Processing, Computer Vision, and Pattern Recognition, Las Vegas, NV, USA, 21–24 July 2014; p. 1. [Google Scholar]
- Shi, J.; Tomasi, C. Good Features to Track; Technical Report; Cornell University: Ithaca, NY, USA, 1993. [Google Scholar]
- Efros, A.A.; Berg, A.C.; Mori, G.; Malik, J. Recognizing action at a distance. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; p. 726. [Google Scholar]
- Tran, D.; Sorokin, A. Human activity recognition with metric learning. In Proceedings of the European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 548–561. [Google Scholar]
- Ercis, F. Comparison of Histogram of Oriented Optical Flow Based Action Recognition Methods. Ph.D. Thesis, Middle East Technical University, Ankara, Turkey, 2012. [Google Scholar]
- Li, H.; Achim, A.; Bull, D.R. GMM-based efficient foreground detection with adaptive region update. In Proceedings of the 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 3181–3184. [Google Scholar]
- Sehgal, S. Human Activity Recognition Using BPNN Classifier on HOG Features. In Proceedings of the 2018 International Conference on Intelligent Circuits and Systems (ICICS), Phagwara, India, 19–20 April 2018; pp. 286–289. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Marszałek, M.; Laptev, I.; Schmid, C. Actions in context. In Proceedings of the CVPR 2009-IEEE Conference on Computer Vision & Pattern Recognition, Miami Beach, FL, USA, 20–25 June 2009; pp. 2929–2936. [Google Scholar]
- Niebles, J.C.; Chen, C.W.; Fei-Fei, L. Modeling temporal structure of decomposable motion segments for activity classification. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 392–405. [Google Scholar]
- Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
- Keselman, L.; Iselin Woodfill, J.; Grunnet-Jepsen, A.; Bhowmik, A. Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1–10. [Google Scholar]
- Chen, J.; Wang, B.; Zeng, H.; Cai, C.; Ma, K.K. Sum-of-gradient based fast intra coding in 3D-HEVC for depth map sequence (SOG-FDIC). J. Vis. Commun. Image Represent. 2017, 48, 329–339. [Google Scholar] [CrossRef]
- Liang, B.; Zheng, L. A survey on human action recognition using depth sensors. In Proceedings of the 2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Adelaide, SA, Australia, 23–25 November 2015; pp. 1–8. [Google Scholar]
- Chen, C.; Liu, K.; Kehtarnavaz, N. Real-time human action recognition based on depth motion maps. J. -Real-Time Image Process. 2016, 12, 155–163. [Google Scholar] [CrossRef]
- El Madany, N.E.D.; He, Y.; Guan, L. Human action recognition via multiview discriminative analysis of canonical correlations. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 5–28 September 2016; pp. 4170–4174. [Google Scholar]
- Yang, X.; Zhang, C.; Tian, Y. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM international conference on Multimedia, Nara, Japan, 29 October–2 November 2012; ACM: New York, NY, USA, 2012; pp. 1057–1060. [Google Scholar]
- Oreifej, O.; Liu, Z. HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
- Wang, J.; Liu, Z.; Chorowski, J.; Chen, Z.; Wu, Y. Robust 3D action recognition with random occupancy patterns. In Computer Vision–ECCV 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 872–885. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Robust 3D action recognition through sampling local appearances and global distributions. IEEE Trans. Multimed. 2018, 20, 1932–1947. [Google Scholar] [CrossRef]
- Seo, H.J.; Milanfar, P. Action recognition from one example. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 867–882. [Google Scholar] [PubMed]
- Satyamurthi, S.; Tian, J.; Chua, M.C.H. Action recognition using multi-directional projected depth motion maps. J. Ambient. Intell. Humaniz. Comput. 2018, 1–7. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikäinen, M.; Mäenpää, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 971–987. [Google Scholar] [CrossRef]
- Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
- Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 9–14. [Google Scholar]
- Kurakin, A.; Zhang, Z.; Liu, Z. A real time system for dynamic hand gesture recognition with a depth sensor. In Proceedings of the 20th European signal processing conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 1975–1979. [Google Scholar]
- Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
- Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Fei-Fei, L. Large-scale Video Classification with Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Farabet, C.; Couprie, C.; Najman, L.; LeCun, Y. Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1915–1929. [Google Scholar] [CrossRef]
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
- Sharif Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. arXiv 2014, arXiv:1403.6382. [Google Scholar]
- Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q.V.; et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2012; pp. 1223–1231. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2014; pp. 568–576. [Google Scholar]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Crammer, K.; Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2001, 2, 265–292. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Zaremba, W.; Sutskever, I. Learning to execute. arXiv 2014, arXiv:1410.4615. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y. Towards good practices for very deep two-stream convNets. arXiv 2015, arXiv:1507.02159. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Sun, L.; Jia, K.; Yeung, D.Y.; Shi, B.E. Human action recognition using factorized spatio-temporal convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4597–4605. [Google Scholar]
- Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A.; Gould, S. Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3034–3042. [Google Scholar]
- Fernando, B.; Gavves, E.; Oramas, J.M.; Ghodrati, A.; Tuytelaars, T. Modeling video evolution for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5378–5387. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; ACM: New York, NY, USA, 2014; pp. 675–678. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
- Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1510–1517. [Google Scholar] [CrossRef]
- Taylor, G.W.; Fergus, R.; LeCun, Y.; Bregler, C. Convolutional learning of spatio-temporal features. In Proceedings of the European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; pp. 140–153. [Google Scholar]
- Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features. IEEE Access 2018, 6, 1155–1166. [Google Scholar] [CrossRef]
- Graves, A.; Fernández, S.; Schmidhuber, J. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the International Conference on Artificial Neural Networks, Warsaw, Poland, 11–15 September 2005; pp. 799–804. [Google Scholar]
- Wang, J.; Cherian, A.; Porikli, F.; Gould, S. Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1149–1158. [Google Scholar]
- Schindler, K.; Van Gool, L. Action snippets: How many frames does human action recognition require? In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, AK, USA, 24–26 June 2008. [Google Scholar]
- Wang, X.; Gao, L.; Wang, P.; Sun, X.; Liu, X. Two-stream 3D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 2018, 20, 634–644. [Google Scholar] [CrossRef]
- Liu, J.; Luo, J.; Shah, M. Recognizing realistic actions from videos in the wild. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Wang, X.; Farhadi, A.; Gupta, A. Actions∼ transformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2658–2667. [Google Scholar]
- Chaquet, J.M.; Carmona, E.J.; Fernández-Caballero, A. A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 2013, 117, 633–659. [Google Scholar] [CrossRef] [Green Version]
- UCF101. Action Recognition Data Set. Available online: https://www.crcv.ucf.edu/data/UCF101.php (accessed on 15 July 2019).
- UCF50. Action Recognition Data Set. Available online: https://www.crcv.ucf.edu/data/UCF50.php (accessed on 15 July 2019).
- HMDB: A large human motion database. Available online: http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/ (accessed on 15 July 2019).
- Actions as Space-Time Shapes. Available online: http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html (accessed on 15 July 2019).
- MSR Action Recognition Dataset. Available online: http://research.microsoft.com/en-us/um/people/zliu/actionrecorsrc/ (accessed on 15 July 2019).
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 961–970. [Google Scholar]
- A Large-Scale Video Benchmark for Human Activity Understanding. Available online: http://activity-net.org/ (accessed on 15 July 2019).
- Goyal, R.; Kahou, S.E.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; Volume 1, p. 3. [Google Scholar]
- The 20BN-something-something Dataset V2. Available online: https://20bn.com/datasets/something-something (accessed on 15 July 2019).
- The Sports-1M Dataset. Available online: https://github.com/gtoderici/sports-1m-dataset/blob/wiki/ProjectHome.md (accessed on 15 July 2019).
- YouTube-8M: A Large and Diverse Labeled Video Dataset for Video Understanding Research. Available online: https://research.google.com/youtube8m/ (accessed on 15 July 2019).
- Gu, C.; Sun, C.; Ross, D.A.; Vondrick, C.; Pantofaru, C.; Li, Y.; Vijayanarasimhan, S.; Toderici, G.; Ricco, S.; Sukthankar, R.; et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6047–6056. [Google Scholar]
- AVA: A Video Dataset of Atomic Visual Action. Available online: https://research.google.com/ava/explore.html (accessed on 15 July 2019).
- Lan, Z.; Lin, M.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 204–212. [Google Scholar]
- A Universal Labeling Tool: Sloth. Available online: https://cvhci.anthropomatik.kit.edu/~baeuml/projects/a-universal-labeling-tool-for-computer-vision-sloth/ (accessed on 15 July 2019).
- Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A Database and Web-Based Tool for Image Annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
- LabelMe. Available online: http://labelme.csail.mit.edu/Release3.0/ (accessed on 15 July 2019).
- LabelBox. Available online: https://labelbox.com/ (accessed on 15 July 2019).
YEAR | SUMMARY | DATASET | |
---|---|---|---|
Bobick et al. [47] | 2001 | Use of motion-energy image (MEI) and motion-history image (MHI). | - |
Schuldt et al. [49] | 2004 | Use of local space-time features to recognize complex motion patterns. | KTH Action [49] |
Niebles et al. [55] | 2007 | Use of a hybrid hierarchical model, combining static and dynamic features. | Weizmann [64] |
Laptev et al. [42] | 2008 | Use of spatio-temporal features and extend spatial pyramids to spatio-temporal pyramids. | KTH Action [49] Hollywood [42] |
Chen et al. [63] | 2009 | Use of HOG for human pose representations and HOOF to characterize human motion. | Weizmann [64] Soccer [83] Tower [63] |
Chaudhry et al. [46] | 2009 | Use of HOOF features by computing optical flow at every frame and binning them according to primary angles. | Weizmann [64] |
Lertniphonphan et al. [45] | 2011 | Use of a motion descriptor based on direction of optical flow. | Weizmann [64] |
Wang et al. [76] | 2013 | Use of camera motion to correct dense trajectories. | HMDB51 [88] UCF101 [89] Hollywood2 [90] Olympic Sports [91] |
Akpinar et al. [81] | 2014 | Use of a generic temporal video segment representation, introducing a new velocity concept: Weighted Frame Velocity. | Weizmann [64] Hollywood [42] |
Kumar et al. [15] | 2016 | Use of a local descriptor built by optical flow vectors along the edges of the action performers. | Weizmann [64] KTH Action [49] |
Sehgal, S. [87] | 2018 | Use of background subtraction, HOG features and BPNN classifier. | Weizmann [64] |
YEAR | SUMMARY | DATASET | |
---|---|---|---|
Yang et al. [98] | 2012 | Use of Depth Motion Maps (DMM), combining them with HOG descriptors. | MSRAction3D [107] |
Oreifej et al. [99] | 2013 | Use of histogram of oriented 4D surface normals (HON4D) descriptor. | MSRAction3D [107] MSRGesture3D [108] 3D Action Pairs [99] |
Liu et al. [102] | 2018 | Use of a two-layer BoVW model, using motion-based and shape-based STIPs to distinguish the action. | MSRAction3D [107] UTKinect-Action [109] MSRGesture3D [108] MSRDailyActivity3D [100] |
Satyamurthi et al. [104] | 2018 | Use of multi-directional projected depth motion maps (MPDMM). | MSRAction3D [107] MSRGesture3D [108] |
YEAR | SUMMARY | DATASET | |
---|---|---|---|
Karpathy et al. [110] | 2014 | Use of different connectivity patterns for CNNs: early fusion, late fusion and slow fusion. | Sports-1M [110] UCF101 [89] |
Simonyan et al. [117] | 2014 | Use of a two-stream CNN architecture, incorporating spatial and temporal networks. | UCF101 [89] HMDB51 [88] |
Donahue et al. [121] | 2015 | Use of a Long-term Recurrent Convolutional Network (LRCN) to learn compositional representations in space and time. | UCF101 [89] |
Wang et al. [123] | 2015 | Use of very deep two-stream convNets, using stacked optical flow for temporal network and a single frame image for spatial network. | UCF101 [89] |
Wang et al. [126] | 2015 | Use of trajectory-pooled deep-convolutional descriptor (TDD). | UCF101 [89] HMDB51 [88] |
Tran et al. [127] | 2015 | Use of deep 3D convolutional networks, which are better for spatio-temporal feature learning. | UCF101 [89] |
Feichtenhofer et al. [129] | 2016 | Use of two-stream architecture associating spatial feature maps of a particular area to temporal feature maps of that region and fusing the networks at an early level. | UCF101 [89] HMDB51 [88] |
Wang et al. [131] | 2016 | Use of Temporal Segment Network (TSN) to incorporate long-range temporal structures avoiding overfitting. | UCF101 [89] HMDB51 [88] |
Bilen et al. [135] | 2016 | Use of image classification CNNs after summarizing the videos in dynamic images. | UCF101 [89] HMDB51 [88] |
Carreira et al. [138] | 2017 | Use of two-stream Inflated 3D ConvNet (I3D), using two different 3D networks for both streams of a two-stream architecture. | UCF101 [89] HMDB51 [88] |
Varol et al. [139] | 2018 | Use of space-time CNNs and architectures with long-term temporal convolutions (LTC), using lower spatial resolution and longer clips. | UCF101 [89] HMDB51 [88] |
Ullah et al. [141] | 2018 | Use of CNNs to reduce complexity and redundancy and deep bidirectional LSTM (DB-LSTM) to learn sequential information among frame features. | UCF101 [89] HMDB51 [88] YouTube actions [146] |
Wang et al. [143] | 2018 | Use of a discriminative pooling, taking into account that just a few frames provide characteristic information about the action. | HMDB51 [88] |
Wang et al. [145] | 2018 | Use of convNets which admit videos of arbitrary size and length, using first a STPP and a LSTM (or CNN-E) then. | UCF101 [89] HMDB51 [88] ACT [147] |
Advantages | Disadvantages | |
---|---|---|
Hand-crafted motion features | - There is no need of a large amount of data for training. - It is simple and unambiguous to understand the model and analyze and visualize the functions. - The features used to train the model are explicitly known. | - Usually these features are not robust. - They can be computationally intensive due to the high dimensions. - The discriminative power is usually low. |
Depth information | - The 3D structure information of the image that depth sensors provide is used to recover postures and recognize the activity. - The skeletons extracted from depth maps are precise. - Depth sensors can work in darkness. | - Depth maps have no texture, making it difficult to apply local differential operators. - The global features can be unsettled because depth maps may contain occlusions. |
Deep Learning | - There is no need of expert knowledge to get suitable features, reducing the effort of feature extraction. - Instead of designing them manually, features are automatically learned through the network. - Deep neural networks can extract high-level representation in deep layer, making it more suitable for complex tasks. | - Need to collect massive data, consequently there is a lack of data sets. - Time consuming. - Problem of models capability of generalization. |
# Classes | # Videos | # Actors | Resolution | Year | |
---|---|---|---|---|---|
Weizmann | 10 | 90 | 9 | 180 × 144 | 2005 |
MSRAction3D | 20 | 420 | 7 | 640 × 480 | 2010 |
HMDB51 | 51 | 6849 | - | 320 × 240 | 2011 |
UCF50 | 50 | 6676 | - | - | 2012 |
UCF101 | 101 | 13,320 | - | 320 × 240 | 2012 |
Sports-1M | 487 | 1,133,158 | - | - | 2014 |
ActivityNet | 203 | 27,801 | - | 1280 × 720 | 2015 |
Something Something | 174 | 220,847 | - | __ (Variable width) × 240 | 2017 |
AVA | 80 | 430 | - | - | 2018 |
METHOD | UCF101 | HMDB51 | Weizmann | |
---|---|---|---|---|
Hand-crafted | Hierarchical [55] | - | - | 72.8% |
Far Field of View [63] | - | - | 100% | |
HOOF NLDS [46] | - | - | 94.4% | |
Direction HOF [45] | - | - | 79.17% | |
iDT [76] | - | 57.2% | - | |
iDT+FV [76] | 85.9% | 57.2% | - | |
OF Based [81] | - | - | 90.32% | |
Edges OF [15] | - | - | 95.69% | |
HOG features [87] | - | - | 99.7% | |
Deep learning | Slow Fusion CNN [110] | 65.4% | - | - |
Two stream (avg) [117] | 86.9% | 58.0% | - | |
Two stream (SVM) [117] | 88.0% | 59.4% | - | |
IDT+MIFS [162] | 89.1% | 65.1% | - | |
LRCN (RGB) [121] | 68.2% | - | - | |
LRCN (FLOW) [121] | 77.28% | - | - | |
LRCN (avg, 1/2-1/2) [121] | 80.9% | - | - | |
LRCN (avg, 1/3-2/3) [121] | 82.34% | - | - | |
Very deep two-stream (VGGNet-16) [123] | 91.4% | - | - | |
TDD [126] | 90.3% | 63.2% | - | |
TDD + iDT [126] | 91.5% | 65.9% | - | |
C3D [127] | 85.2% | - | - | |
C3D + iDT [127] | 90.4% | - | - | |
TwoStreamFusion [129] | 92.5% | 65.4% | - | |
TwoStreamFusion+iDT [129] | 93.5% | 69.2% | - | |
TSN (RGB+FLOW) [131] | 94.0% | 68.5% | - | |
TSN (RGB+FLOW+WF) [131] | 94.2% | 69.4% | - | |
Dynamic images + iDT [135] | 89.1% | 65.2% | - | |
Two-StreamI3D [138] | 93.4% | 66.4% | - | |
Two-StreamI3D, pre-trained [138] | 97.9% | 80.2% | - | |
LTC (RGB) [139] | 82.4% | - | - | |
LTC (FLOW) [139] | 85.2% | 59.0% | - | |
LTC(FLOW+RGB) [139] | 91.7% | 64.8% | - | |
LTC(FLOW+RGB)+iDT [139] | 92.7% | 67.2% | - | |
DB-LSTM [141] | 91.21% | 87.64% | - | |
Two-Stream SVMP(VGGNet) [143] | - | 66.1% | - | |
Two-Stream SVMP(ResNet) [143] | - | 71.0% | - | |
Two-Stream SVMP(+ iDT) [143] | - | 72.6% | - | |
Two-Stream SVMP(I3D conf) [143] | - | 83.1% | - | |
STPP + CNN-E (RGB) [145] | 85.6% | 62.1% | - | |
STPP + LSTM (RGB) [145] | 85.0% | 62.5% | - | |
STPP + CNN-E (FLOW) [145] | 83.2% | 55.4% | - | |
STPP + LSTM (FLOW) [145] | 83.8% | 54.7% | - | |
STPP + CNN-E (RGB+FLOW) [145] | 92.4% | 70.5% | - | |
STPP + LSTM (RGB+FLOW) [145] | 92.6% | 70.3% | - |
METHOD | YEAR | PAPER | CODE |
---|---|---|---|
Deep Learning | 2018 | Video representation learning using discriminative pooling [143] | SVMP https://github.com/3xWangDot/SVMP |
Deep Learning | 2018 | Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features [141] | Bi-directional LSTM https://github.com/Aminullah6264/BidirectionalLSTM |
Deep Learning | 2018 | Long-term temporal convolutions for action recognition [139] | LTC https://github.com/gulvarol/ltc |
Deep Learning | 2017 | Quo vadis, action recognition? A new model and the Kinetics dataset [138] | Two-Stream I3D https://github.com/deepmind/kinetics-i3d |
Deep Learning | 2016 | Dynamic image networks for action recognition [135] | Dynamic images https://github.com/hbilen/dynamic-image-nets |
Deep Learning | 2016 | Temporal segment networks: Towards good practices for deep action recognition [131] | TSN https://github.com/yjxiong/temporal-segment-networks |
Deep Learning | 2016 | Convolutional two-stream network fusion for video action recognition [129] | Two-Stream Fusion https://github.com/feichtenhofer/twostreamfusion |
Deep Learning | 2015 | Learning spatiotemporal features with 3D convolutional networks [127] | C3D https://github.com/facebook/C3D |
Deep Learning | 2015 | Action recognition with trajectory-pooled deep-convolutional descriptors [126] | TDD https://github.com/wanglimin/tdd/ |
Deep Learning | 2015 | Towards good practices for very deep two-stream convNets [123] | Very deep Two-Stream convNets https://github.com/yjxiong/caffe/tree/action_recog |
Depth information | 2013 | HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences [99] | HON4D http://www.cs.ucf.edu/~oreifej/HON4D.html |
Hand-crafted motion features | 2013 | Action Recognition with Improved Trajectories [76] | Improved Trajectories http://lear.inrialpes.fr/~wang/improved_trajectories |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rodríguez-Moreno, I.; Martínez-Otzeta, J.M.; Sierra, B.; Rodriguez, I.; Jauregi, E. Video Activity Recognition: State-of-the-Art. Sensors 2019, 19, 3160. https://doi.org/10.3390/s19143160
Rodríguez-Moreno I, Martínez-Otzeta JM, Sierra B, Rodriguez I, Jauregi E. Video Activity Recognition: State-of-the-Art. Sensors. 2019; 19(14):3160. https://doi.org/10.3390/s19143160
Chicago/Turabian StyleRodríguez-Moreno, Itsaso, José María Martínez-Otzeta, Basilio Sierra, Igor Rodriguez, and Ekaitz Jauregi. 2019. "Video Activity Recognition: State-of-the-Art" Sensors 19, no. 14: 3160. https://doi.org/10.3390/s19143160