[go: up one dir, main page]

Academia.eduAcademia.edu
2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) Action Recognition in Low Quality Videos by Jointly Using Shape, Motion and Texture Features Saimunur Rahman∗ , John See∗† , Chiung Ching Ho ∗ Centre of Visual Computing, Faculty of Computing and Informatics Multimedia University Cyberjaya 63100, Selangor, Malaysia Email: saimunur.rahman14@student.mmu.edu.my † Email: johnsee@mmu.edu.my ‡ Email: ccho@mmu.edu.my Abstract—Shape, motion and texture features have recently gained much popularity in their use for human action recognition. While many of these descriptors have been shown to work well against challenging variations such as appearance, pose and illumination, the problem of low video quality is relatively unexplored. In this paper, we propose a new idea of jointly employing these three features within a standard bag-of-features framework to recognize actions in low quality videos. The performance of these features were extensively evaluated and analyzed under three spatial downsampling and three temporal downsampling modes. Experiments conducted on the KTH and Weizmann datasets with several combination of features and settings showed the importance of all three features (HOG, HOF, LBP-TOP), and how low quality videos can benefit from the robustness of textural features. I. ∗‡ I NTRODUCTION Human action recognition in video is an active area of research in computer vision, with many applications in various fields including video surveillance, content-based video archiving and browsing, and human computer interaction. Actions in video undergo a wide range of variations such as size, appearance and view pose, while more challenging problems such as occlusion, illumination change, shadow, and camera motions remained difficult problems that are actively studied today. One relatively under-studied problem is the quality of videos. Current research on video have focused on high-definition videos that offer tremendous details and strong fidelity of signal. However, most of these videos are not feasible for real-time video processing, streaming data and mobile applications, particularly when additional processing is required for the recognition of actions in video. Visual recognition approaches for images has recently been extended for use in video sequences, with good measures of succes. Particularly, bag-of-features (or bagof-visual-words) based methods have also shown excellent results for action recognition [1]–[3]. Despite recent developments, the representation of local regions in videos is still an open field of research. For representation of videos, different spatio-temporal features have been considered in literature. Many popular works [1], [4], [5] prefer utilizing gradient and flow information to describe the shape and motion that lies in the video. The use of textures are less common [6], [7], though there are promising benefits that can be leveraged. Oh et al. [8], in establishing the recent large-scale 978-1-4799-8996-6/15/$31.00 c 2015 IEEE VIRAT dataset for continuous surveillance, provided nine different downsampled versions of the data in the initial version1 ), consisting of three spatial scales and three temporal frame rates. The authors note that this is a ”relatively unexplored area” and that ”it is important to understand how existing approaches will behave differently”. Motivated by the known merits of different features and the lack of work in low quality videos, we aim to investigate and present viable approaches to this problem. In this paper, we propose a joint utilization of shape, motion and texture features for robust recognition of human actions in low quality downsampled videos. This idea of representation integrates these well-established feature methods in a new way that alleviates their individual shortcomings. We also investigate and analyze the performance of action recognition reacts under two low quality conditions – spatial downsampling and temporal downsampling. We conduct an extensive set of experiments on two benchmark action datasets, the KTH and Weizmann, both of which are already low in frame resolution in its original form. Finally, the viability of our proposed approach is further analyzed, providing insights into good combination of features and the importance of using kernels to provide a balanced set of features that fit well to the data. A. Related Work Human action recognition has been studied extensively in recent years [9]. From the the recent research in activity recognition roughly, spatio-temporal video features can be categorized into three main different categories based on the nature of feature used for classification: dynamic feature (motion), structure (shape) and texture, or implicit or explicit combination of three. Most recent works employ primarily motion and shape features [3]. Laptev [10] first proposed the extraction of shape (HOG) and motion (HOF) information from spatio-temporal interest points (STIP) to classify human actions in video. More recently, Wang et al. [5] proposed the use of dense trajectories with the same way of encoding the shape and motion information. All these methods appear to suggest that the combination of shape and motion features performs better than using them alone. Spatio-temporal texture features such as LBP-TOP [11] have also found their way to action recognition. Kel1 As of today, these downsampled versions are no longer available in the current VIRAT version 2.0. Website: http://www.viratdata.org/ 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) lokumpu et al. [6] proposed the use of LBP-TOP descriptor to recognize human actions by applying it on the entire bounding volume area. Mattivi and Shao [7] applied LBPTOP over small video patches called cuboids which are extracted from each interest point, resulting in a more sparse representation of video sequences. Their approach managed a promising accuracy rate of around 91% on the KTH dataset. II. S PATIO -T EMPORAL V IDEO F EATURES In the following sections we describe the three types of spatio-temporal features that can be extracted from action videos, namely structural (shape), dynamic (motion) and textural (texture) features. As structural and dynamic features are somewhat related, we shall describe them together in Section A, while textural feature is elaborated in Section B. A. Structural and Dynamic Features Generally speaking, structural information in video embodies the geometrical or shape-oriented variations found spatially; dynamic information in video carries important temporal information or changes of its structure across time. These two forms of information are typically taken together to exemplify spatio-temporal information in video. For each given sample point (x, y, t, σ, τ ), a feature descriptor is computed for a 3-D video patch centered at (x, y, t) at spatial and temporal scales σ, τ . In this work, we employ the Harris3D detector (a space-time extension of the popular Harris detector [12]) to obtain spatio-temporal interest points (STIP) [10]. Briefly, a spatiotemporal second-moment matrix is computed at each video point µ(.; σ; τ ) = g(.; sσ; sτ ) ∗ (∇L(.; σ; τ )L(.; σ; τ ))T using a separable Gaussian smoothing function g, and space time gradients ∇L. The final location of the detected STIPs are given by local maxima of H = det(µ) − ktrace3 (µ) [3]. We used the original implementation available online and standard parameter settings i.e. k = 0.00005, σ 2 = {4, 8, 16, 32, 64, 128} and τ 2 = {2, 4}, for original videos and a majority of downsampled videos.Figure 1 shows the Harris3D detector being used to extract STIPs on the KTH dataset. for grid parameters nx , ny = 3, nt = 2 for all videos, as suggested in the original paper. B. Textural Features Textures are defined as statistical regularities over both space and time, e.g. motion of birds in a flock which was recently used for action recognition with good results. [7]. One of the most widely-used texture descriptor, Local Binary Pattern (LBP) produces a binary code at each pixel location by thresholding pixels within a circular neighborhood region by its center pixel [13]. The LBPP,R operator produces 2P different output values, corresponding to the 2P different binary patterns that can be formed by the P pixels in the neighborhood set. After computing these LBP patterns for the whole image, an occurrence histogram is constructed to provide a statistical description of the distribution of local textural patterns in the image. This descriptor has been proved to be successful in face recognition [14]. In order to be applicable in the context of dynamic textures such as facial expressions, Zhao et al. [11] proposed LBP on Three Orthogonal Planes (LBP-TOP), where LBP is performed on the three orthogonal planes (XY, XT, YT) in the video volume by concatenating their respective occurrence histograms into a single histogram. LBP-TOP is formally expressed by LBP − T OPPXY ,PXT ,PY T ,RX ,RY ,RZ where the subscripts denote a neighborhood of P points equally sampled on a circle of radius R on XY, XT and YT planes respectively. The resulting feature vector is 3 · 2P in length. Fig. 2 illustrates the construction of the LBP-TOP descriptor. As can be seen, LBP-TOP encodes the appearance and motion along three directions, incorporating spatial information in XY-LBP and spatial temporal co-occurrence statistics in XT-LBP and YT-LBP. In this experiment we apply the parameter settings of LBP − T OP8,8,8,2,2,2 with non-uniform patterns as specified by Mattivi and Shao [7], which produces a feature vector length of 768. Fig. 2. Fig. 1. Harris3D feature detector on KTH data set To characterize the shape and motion information accumulated in space-time neighborhoods of the detected STIPs, we applied Histogram of Gradient (HOG) and Histogram of Optical Flow (HOF) descriptors as proposed by Laptev in [10]. The combination of HOG/HOF descriptors with interest point detectors produces descriptors of size ∆x (σ) = ∆y (σ) = 18σ, ∆t (τ ) = 8τ . Each volume is subdivided into a nx ×ny ×nt grid of cells; for each cell, 4-bin histograms of gradient orientations (HOG) and 5-bin histograms of optical flow (HOF) are computed [3]. In this experiment we opted LBP-TOP feature descriptor. Image from [6] III. V IDEO D OWNSAMPLING A video’s spatial resolution and temporal sampling rate defines the amount of spatial and temporal information it can convey. Spatial resolution is simply the video’s horizontal pixel count by its vertical pixel count, i.e. frame size. The temporal sampling rate defines the number of discrete frames in a unit of time, i.e. frames per second (fps) or Hertz (Hz). In this work, we investigate the performance of action recognition with low quality videos that have been downsampled spatially or temporally, proposing suitable features 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) that are robust. For now, we first describe the spatial and temporal downsampling modes that were employed in this work. A. Spatial Downsampling Spatial downsampling produces an output video with a smaller resolution than the original video. In the process, no additional data compression is applied while the frame rates remained the same. For clarity, we define a spatial downsampling factor, α which indicates the factor in which the original spatial resolution is reduced. In this work, we fixed α = {2, 3, 4} for modes SDα , denoting that the original videos are to be downsampled to half, a third and a fourth of its original resolution respectively. Fig. 3 shows a sample video frame that undergoes SD2 , SD3 and SD4 . We opted not to go beyond α = 4 as extracted features are too few and sparse to provide any meaningful representation. Fig. 4. Temporal Downsampling; (a) Original video (b) T D2 ; (c) T D3 ; A. Datasets We have conducted our experiments on two notable action recognition datasets – the KTH actions dataset [16] and the Weizmann dataset [15]. Both datasets are similar in the way that they are captured in a controlled environment with homogeneously uniform background. Fig. 3. Spatially downsampled videos. (a) Original (SD1 ); (b) SD2 ; (c) SD3 ; (d) SD4 ; B. Temporal Downsampling Temporal downsampling produces an output video with smaller temporal sampling rate (or frame rate) than the original video. In the process, the video frame resolution remained the same. Likewise, we also define a temporal downsampling factor, β which indicates the factor in which the original frame rate is reduced. It has been seen that high temporal resolution; with high spatial resolution produces high dynamic range i.e. high motion information. It is based on the assumption that nonconstant intervals would yield jerky motion, i.e. perceivable discontinuity in the optical flow field. This assumption is true for the majority of video sequences, which contain motion, captured at the frame rate of 30 or less. Low quality videos usually have this kind of motion discontinuity. In this work, we use values of β = {2, 3, 4} for modes T Dβ , denoting that the original videos are to be downsampled to half, a third and a fourth of its original frame rate respectively. In the case of videos with slow frame rates or short video lengths (such as in the Weizmann dataset [15]), β may only take on smaller range of values to extract sufficient features for representation. IV. E XPERIMENTS In this section, we describe a set of extensive experiments and their respective results, while analyzing and comparing different combination of feature descriptors discussed earlier. Experiments were conducted separately for spatial downsampling and temporal downsampling to demonstrate the strengths of specific features with respect to each condition. We also provide a detailed elaboration of the evaluation framework and settings used for the different experimented datasets. KTH is the most popular dataset in literature for human action recognition. It contains 6 action classes: walking, running, jogging, hand-waving, hand-clapping and boxing; performed by 25 actors in 4 different scenarios: outdoors, outdoors with scale variation, outdoors with different cloths and indoors. There are 599 video samples in total (one subject has less one clip). Each clip is sampled at 25 fps and lasts between 10–15 seconds with image frame resolution of 160 × 120 pixels. We follow the original experimental setup, i.e., divide the samples into test set (9 subjects: 2, 3, 5, 6, 7, 8, 9, 10, and 22) and training set (the remaining 16 subjects) [16], while reporting the average accuracy over all classes as performance measure. The Weizmann dataset was introduced by Blank et al. [15]. It contains 93 video clips from 9 different subjects (3 subjects have one extra clip) with each video clip containing one subject performing a single action. There are 10 different action categories: walking, running, jumping, gallop sideways, bending, one-hand waving, two-hands waving, jumping in place, jumping jack, and skipping. Each clip lasts about 2–3 seconds at 25 fps (interlaced) with image frame resolution of 180 × 144 pixels. Testing is performed by leave-one-person-out cross-validation (as suggested in [4]) i.e., for each fold, training is done on 8 subjects and testing on all videos of the remaining held-out subject. B. Evaluation Framework A video sequence is represented as a bag of local spatiotemporal features [16]. Spatio-temporal features are first quantized into visual words and a video is then represented as the frequency histogram over the visual words. In our experiments, vocabularies are constructed with standard kmeans clustering with the number of visual words empirically set to K = 2000 to obtain a reasonably good performance across datasets. To limit the complexity, we cluster a subset of 100,000 randomly selected training features. To increase precision, we initialize k-means 8 times and kept the result with the lowest error. Features are assigned to their closest vocabulary word using Euclidean distance. The 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) resulting histograms of visual word occurrences are used as video sequence representations. K X 1 (hin − hjn )2 ) K(Hi , Hj) = exp(− 2A n=1 hin − hjn which was previously found to be effective for action recognition [1]. Here, hin and hjn are the frequency histograms of the n-th word occurrences, K is the vocabulary size, and A is the mean value of distances between all training samples [18]. In some parts of our experiments, we also tested with a linear kernel instead of χ2 kernel, which is known to over-fit the feature data occasionally at higher dimensionality. For multi-class classification, we apply the one-against-rest approach and select the class with the highest score. 90 80 Recognition Rate (%) For classification, we use a non-linear support vector machine (SVM) [17] with a χ2 -kernel. 100 70 60 50 40 30 20 HOG HOF HOG+HOF 10 0 I II III IV V Fig. 5. Recognition rate of different combination of features on original KTH dataset videos 100 90 In this subsection we present the experimental results in three parts, based on the original videos, spatially downsampled videos and temporally downsampled videos. For each part, we systematically compare and analyze the performance of different feature descriptors, providing further insights into the intuition behind the different feature types. Experiments were conducted on an Intel Core-i7 3.6 GHz machine with 24GB RAM. For ease of reporting, we will compare the following combination of features and settings in all experiments, denoted as follows: I: STIP; II: STIP-χ2 ; III: STIP + LBP-TOP; IV: STIP + LBP-TOP-χ2 ; V: (STIP + LBPTOP)-χ2 . The HOG, HOF and HOG + HOF descriptors will be used on the extracted STIPs, while LBP-TOP is applied on the entire video volume. For features III, IV and V, the STIP-based descriptors are concatenated with the LBP-TOP at the histogram level. 1) Experiments on Original Videos: On the KTH dataset, we obtained the best result of 94.91% using the combination of HOG and HOF features (HOG+HOF) (see Figure 5), which constitutes a histogram-level concatenation of HOG and HOF as opposed to a descriptor-level concatenation (HOGHOF) advocated in [1], [3]. Figure IV-C1 shows that this clearly helps to elevate the overall accuracy by 3–8%. However, on the Weizmann dataset (see Figure 6) , we observe that there is less distinction between the three tested features, with the HOF holding a slight advantage in terms of performance. The best result of 94.44% was achieved using HOF feature. For both datasets, we also observed that kernelization of specific features are able to strengthen results. For instance on the KTH, HOF + LBP-TOP with an already impressive 93.06% accuracy, is even higher at 94.44% after kernelizing the LBP-TOP features. This is most apparent when LBPTOP features are kernelized (see Figure IV-C1). Other features in consideration also show similar characteristic except for HOF, which has negligible difference. In short, dynamic feature (HOF) is notably essential for effective action recognition on the original video samples. Shape feature (HOG) is largely poor on all combinations, but improves tremendously when paired with textural Recognition Rate (%) 80 C. Experimental Results 70 60 50 40 30 20 HOG HOF HOG+HOF 10 0 I II III IV V Fig. 6. Recognition rate of different combination of features on original Weizmann dataset videos feature (LBP-TOP). 2) Experiments on Spatially Downsampled Videos: Table I shows the recognition rate of the five descriptor combinations (I-V) with different STIP descriptors, on the KTH dataset. Overall, the combination of STIP descriptors + kernelized LBP-TOP appear to dominate the best results within each mode. This clearly shows the important role of motion and textural information with respect to deterioration of spatial quality. As expected, shape information becomes less discriminant as spatial resolution decreases. More promisingly, LBP-TOP contributes significantly more (comparing IV to I and II) as the resolution quality decreases. Nevertheless, it performed well on the Weizmann dataset but not on the KTH datasetwhen used entirely alone (see Figures 9 and 10). Combinations IV and V are the two most robust methods, where the STIP descriptors (particularly the HOF feature) are combined with LBP-TOP to great effect; the kernelized LBP-TOP achieving 87.5% accuracy rate at α = 4. STIPs were extracted with k = 0.0001, 0.000075 and 0.00005 for SD2 , SD3 and SD4 respectively to ensure maximum number of interest points with respect to spatial size. 3) Experiments on Temporally Downsampled Videos: Both the KTH and Weizmann datasets have a frame rate of 25 fps; upon downsampling, T D1 : 12.5 fps, T D2 : 8.33 fps and T D3 : 6.25 fps. Table 1 summarizes the recognition rate of the five descriptor combinations (I and V) with different STIP descriptors, on the KTH dataset. Similarly, 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) 95 95 94 90 Recognition Rate (%) Recognition Rate (%) 93 92 91 90 89 85 80 HOG+HOF+LBP-TOP -χ HOG+HOF 2 HOF+LBP-TOP -χ 75 88 HOG+HOF HOGHOF 87 HOG+LBP-TOP -χ 2 LBP-TOP -χ 70 2 2 86 I II III IV V Fig. 7. Comparison between recognition performance of HOG+HOF (histogram-level concatenation) and HOGHOF (descriptor-level concatenation) SD 1 SD 2 SD 3 SD 4 Fig. 9. Performance of selected combination of different features across spatial downsampling modes for KTH dataset 100 90 Without kernel With kernel 90 70 Recogntion Rate (%) Recognition Rate (%) 80 60 50 40 30 20 85 80 75 HOG+HOF+LBP-TOP -χ 2 HOG+HOF HOF+LBP-TOP -χ 2 2 HOG+LBP-TOP -χ 70 10 LBP-TOP -χ 65 SD 1 0 LBP-TOP HOG HOF HOGHOF Fig. 8. Recognition accuracy with and without χ2 -kernel, on the original KTH videos. TABLE I. R ECOGNITION RATE (%) OF VARIOUS DESCRIPTOR COMBINATIONS FOR SPATIALLY DOWNSAMPLED KTH VIDEOS Mode SD2 SD3 SD4 Combination I II III IV V I II III IV V I II III IV V HOG 68.06 77.22 67.59 81.48 75.46 62.50 62.50 62.50 77.31 71.76 56.94 57.94 62.50 78.24 69.44 HOF 91.67 92.13 92.59 93.06 90.74 87.04 85.65 87.04 88.43 86.57 81.94 80.56 82.87 87.50 83.8 HOG+HOF 94.91 92.13 93.52 94.44 91.67 87.50 85.19 87.50 89.81 85.19 81.20 82.87 81.02 86.11 84.26 SD 2 SD 3 2 SD 4 Fig. 10. Performance of selected combination of different features across spatial downsampling modes for Weizmann dataset TABLE II. R ECOGNITION RATE (%) OF VARIOUS DESCRIPTOR COMBINATIONS FOR TEMPORALLY DOWNSAMPLED KTH VIDEOS Mode T D2 T D3 T D4 Combination I II III IV V I II III IV V I II III IV V HOG 76.39 80.56 75.00 80.09 79.17 68.06 75.46 74.07 75.46 73.15 66.67 73.15 69.44 74.04 72.69 HOF 87.04 86.11 88.89 89.81 87.04 76.85 77.31 78.24 82.87 79.63 71.76 73.15 73.61 75.46 69.44 HOG+HOF 91.20 89.81 91.20 92.59 91.20 82.41 84.26 86.11 85.19 82.87 82.41 81.94 77.78 82.41 81.48 we see a strong showing when LBP-TOP is incorporated with around 3–6% improvement in accuracy. Again, this demonstrates the importance of textural information when temporal sampling rate is poor. On the KTH, we see that the use of all three features (shape, motion and texture) promotes robustness against deterioration of temporal quality (Figure 11). Method IV commands a respectable 82.41% accuracy rate at β = 4. (particularly for T D4 for KTH) since dynamic information becomes more sparse and disjointed. Figures 11 and 12 show the performance of selected feature combinations for different downsampling modes on the KTH and Weizmann dataset respectively. It is also worth mentioning that shape information becomes increasingly useful with the reduction of frame rate We intend to extend our evaluation to videos from more complex and uncontrolled environments [1], [8]. While 4) Future Directions: Based on this preliminary work and the analysis of the results obtained, there are several possible directions for future work. 2015 IEEE International Conference on Signal and Image Processing Applications (ICSIPA) VI. 95 HOG+HOF+LBP-TOP -χ 2 HOG+HOF HOF+LBP-TOP -χ 2 Recognition Rate (%) 90 HOG+LBP-TOP -χ 2 LBP-TOP -χ 2 85 This work is supported, in part, by the Ministry of Education, Malaysia under Fundamental Research Grant Scheme (FRGS) project FRGS/2/2013/ICT07/MMU/03/4. R EFERENCES 80 [1] 75 70 [2] 65 TD 1 TD 2 TD 3 TD 4 Fig. 11. Performance of selected combination of different features across temporal downsampling modes for KTH dataset [3] [4] [5] 90 [6] Recognition Rate (%) 80 [7] 70 60 [8] 50 40 ACKNOWLEDGMENT HOG+HOF+LBP-TOP -χ 2 HOG+HOF HOF+LBP-TOP -χ 2 [9] HOG+LBP-TOP -χ 2 LBP-TOP -χ 2 30 TD 1 TD 2 TD 3 [10] [11] Fig. 12. Performance of selected combination of different features across temporal downsampling modes for Weizmann dataset [12] our experiments already point towards the sensitivity of different features (shape information is sensitive towards resolution, motion information is sensitive towards sampling/frame rate), it will be interesting to investigate the simultaneous effects of both spatial and temporal downsampling. How well can textural features prop up the recognition capability? Also, the use of LBP-TOP in this work merely illustrates the potential benefits of spatio-temporal texture descriptors in general. We intend to explore other spatio-temporal textural features that might exhibit more robustness towards video quality. V. [13] [14] [15] [16] [17] C ONCLUSION In this paper, we explore a new notion of jointly using shape, motion and texture features for action recognition in low quality videos. To the best of our knowledge, there are no existing systematic attempts to investigate the problem of video quality, which is most relevant in many consumer applications and real-life scenarios. This preliminary work draws interesting conclusions on how spatially and temporally downsampled videos can particularly benefit from textural information, considering that most common approaches involved only structural and dynamic information. The combined usage of all three features (HOG+HOF+LBP-TOP) outperforms the other competing methods across a majority of cases. Our best method is able to limit the drop in accuracy to around 8-10% when the video resolutions and frame rates deteriorate to a fourth of their original values. [18] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in IEEE CVPR, 2008, pp. 1–8. J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” IJCV, vol. 79, no. 3, pp. 299–318, 2008. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” in BMVC, 2009, pp. 124–1. P. Scovanner, S. Ali, and M. Shah, “A 3-dimensional sift descriptor and its application to action recognition,” in Proc. of the 15th Int. Conf. on Multimedia. ACM, 2007, pp. 357–360. H. Wang, A. Klaser, C. Schmid, and C.-L. Liu, “Action recognition by dense trajectories,” in IEEE CVPR, 2011, pp. 3169–3176. V. Kellokumpu, G. Zhao, and M. Pietikäinen, “Human activity recognition using a dynamic texture based method.” in BMVC, 2008. R. Mattivi and L. Shao, “Human action recognition using lbp-top as sparse spatio-temporal feature descriptor,” in Computer Analysis of Images and Patterns. Springer, 2009, pp. 740–747. S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in IEEE CVPR, 2011, pp. 3153–3160. J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011. I. Laptev, “On space-time interest points,” International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107–123, 2005. G. Zhao and M. Pietikainen, “Dynamic texture recognition using local binary patterns with an application to facial expressions,” IEEE Trans. PAMI, vol. 29, no. 6, pp. 915–928, 2007. C. Harris and M. Stephens, “A combined corner and edge detector,” in Proc. of 4th Alvey Vision Conference, vol. 15, 1988, p. 50. T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Trans. PAMI, vol. 24, no. 7, pp. 971–987, 2002. T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,” IEEE Trans. PAMI, vol. 28, no. 12, pp. 2037–2041, 2006. M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in IEEE ICCV, vol. 2, 2005, pp. 1395–1402. C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local svm approach,” in Int. Conf. on Pattern Recognition, vol. 3, 2004, pp. 32–36. A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library of computer vision algorithms,” in Proc. of the Int. Conf. on Multimedia. ACM, 2010, pp. 1469–1472. J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: A comprehensive study,” IJCV, vol. 73, no. 2, pp. 213–238, 2007.