[go: up one dir, main page]

Academia.eduAcademia.edu
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221013866 Human action recognition using Dynamic Time Warping Conference Paper · July 2011 DOI: 10.1109/ICEEI.2011.6021605 · Source: DBLP CITATIONS READS 40 604 3 authors, including: Nur Ulfa Maulidevi Peb Ruswono Aryan 20 PUBLICATIONS 65 CITATIONS 17 PUBLICATIONS 46 CITATIONS Bandung Institute of Technology SEE PROFILE Bandung Institute of Technology SEE PROFILE Some of the authors of this publication are also working on these related projects: Knowledge Model to support autonomous knowledge transfer between Knowledge-based Systems View project All content following this page was uploaded by Peb Ruswono Aryan on 02 October 2014. The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. K3 - 3 2011 International Conference on Electrical Engineering and Informatics 17-19 July 2011, Bandung, Indonesia Human Action Recognition Using Dynamic Time Warping Samsu Sempena 1, Dr. Nur Ulfa Maulidevi, S.T, M.Sc 2, Peb Ruswono Aryan,M.T 3 Institut Teknologi Bandung/School of Electrical Information Engineering Jl. Ganesha 10, Bandung, Indonesia 1 samsu@students.itb.ac.id 2 ulfa@stei.itb.ac.id 3 peb@stei.itb.ac.id Abstract— Human action recognition is gaining interest from many computer vision researchers because of its wide variety of potential applications. For instance: surveillance, advanced human computer interaction, content-based video retrieval, or athletic performance analysis. In this research, we focus to recognize some human actions such as waving, punching, clapping, etc. We choose exemplar-based sequential singlelayered approach using Dynamic Time Warping (DTW) because of its robustness against variation in speed or style in performing action. For improving recognition rate, we perform body part tracking using depth camera to recover human joints body part information in 3D real world coordinate system. We build our feature vector from joint orientation along time series that invariant to human body size. Dynamic Time Warping is then applied to the resulted feature vector. We examine our approach to recognize several actions and we confirm our method can work well with several experiments. Further experiment for benchmarking the result will be held in near future. Keywords— human action recognition, exemplar-based approach, Dynamic Time Warping, depth camera I. INTRODUCTION Human action recognition is gaining interest from many computer vision researchers in the past two decades. It is motivated by a wide variety of potential application, such as surveillance applications in public spaces to detect abnormal or suspicious activities, medical monitoring (children or elderly persons), athletic performance monitoring, context-aware pervasive system, and another advanced human-computer interaction [1, 2, 5]. However, there are numbers of reason why human activity recognition is a very challenging problem. Firstly, human body is non rigid and has many degrees of freedom. The drawback is human body can generate infinite variations for every basic movements. Secondly, there is no two persons are identical in terms of body shape, volume, and gesture style. Those mentioned problems get more complex by uncertainties such as variation in viewpoint, illumination, shadow, self-occlusion, deformation, noise, clothing, and so on[8]. Since the problem is very vast, usually researchers make a set of assumptions to make the problem more tractable. In the other hand, current research showed that combining color, depth, and motion can improve segmentation result[11]. This research looks more promising today, since depth camera is becoming ubiquitous, easy to use, and reasonable price. Additionally, computer vision algorithms are becoming more 978-1-4577-0752-0/11/$26.00 ©2011 IEEE mature for tracking human poses accurately given reliable segmentation result. These two improvements become the baseline for our research. Combing depth camera and mature computer vision algorithm, we bound our research to recognize human action given human pose estimation in 3D coordinate system. Related methods for low level processing can be found in vast literature available. Aggarwal and Ryoo have provided a tidy overview of state-of-the-art in human activity recognition methodologies. Based on they overview, our research can be classified in the part of sequential, single-layered approaches using exemplarbased. We choose exemplar-based sequential single-layered approach to recognize human action because of its robustness against variation in speed or style in performing action. Moreover, it requires less training data than state-based approach [2]. Several researches using Dynamic Time Warping from another authors recently for example Zaw(2011) use multiple camera to acquire depth information [8]. Ho(2010) use inertial sensor to get human joints body part information for recognizing upper body generated action [13]. In our approach, each frame from depth camera is considered as observation (feature vector) and deduces that an activity has occurred in the video if it is able to observe a particular sequence represent the activity. Sequential approach first converts a sequence of images into a sequence of feature vectors by extracting features. Exemplar-based recognition describe classes of human actions using training samples directly and maintain either a representation sequence per activity, and match them with a new sequence to recognize its activity. Feature vector is built using joint orientation of each body part joints describing the action of a person per each frame . Joint orientation was selected because it’s invariant to human body size. For recognition action, we compare video input to list of defined action using Dynamic Time Warping method. The remaining parts of this paper are organized as follows: in section 2, methods use in this research are described and in section 3 experiments on human action recognition are presented and finally followed by conclusion and future work in section 4. II. ACTION REPRESENTATION Quaternions (w, x, y, z) typically represent a rotation about the (x, y, z) axis by an angle of 2.1 Pose Estimation Human motion is often considered as a continuous evolution of the spatial configuration of the segments or body posture. If body joints can be reliably extracted and tracked from a sequence depth maps, action recognition can be achieved by using the tracked joint positions [7]. Therefore, firstly we estimate pose of human from the depth maps. Although pose estimation is an ill-problem in computer vision because of its complexity caused by many variations between human, recent research showed that using single depth maps, human body parts can be recognized quite well in real time[12]. Therefore, we use this pose estimation result as preliminary data for our research. Compared to 3-by-3 rotational matrices, quaternions are also more compact, requiring only 4 storage units, instead of 9. These properties of quaternions make their use favourable for representing rotational representations. We represent human pose using stick figure consisted of 15 joints of body part, which is shown in figure 1. Figure 2. Graphical representation of quaternion units product as 90 degree rotation in 4D-space Firstly, we convert our orientation from Euler angles to quaternion using method described in [15]. The feature vector was formed by concatenating the 15 quaternions of the respective body parts to form a column vector of 60 elements. Figure 1. Stick figure representation with 15 joints of body part For rotation and scale invariant, we use joint orientation of each joint relative to world coordinate rather than joint position to describe human motion. 2.2 Rotational Representation We present orientation using quaternion. Quaternions are a compact and complete representation of rotations in 3D space compared with Euler angles. Quaternions are built from 4 dimensions tuples (W,X,Y,Z). In a quaternion representation of rotation, singularities are avoided, therefore giving a more efficient and accurate representation of rotational transformations. A quaternion, which is of 4 dimensions, has a norm of 1, and is typically represented by one real dimension and three imaginary dimensions. The three imaginary dimensions, which are i, j, and k, are unit length and orthogonal one another. = + + + + 2+ 2+ 2=1 = , =− , =− 2 Figure 3. Sequences of depth maps overlayed with segmented human region and skeleton tracking result for some human actions: clap, wave, smash III. DYNAMIC TIME WARPING 2. Activity recognition here is the process of classifying multivariate series. We classify activities using nearest neighbour algorithm with dynamic time warping (DTW) as the distance measure. DTW is a well-known algorithm in many areas especially in speech recognition. However since gesture and speech having many similar characteristics, such as varying in duration and feature, therefore techniques used for speech recognition have often been adapted and used in gesture recognition. DTW algorithm is very popular since it can be being extremely efficient in the time-series similarity measure which minimizes the effects of shifting and distortion in time by creating warping path to detect similar shapes with different phases [9]. DTW yields optimal solution in the O(MN) polynomial time which could be improved further through different technique such as multi-scaling. In our action recognition system, we express feature vectors of two of the gestures to be compared against each other as, two time series X and Y. =( =( 1, , 2 , … , �1 , … 1, , 2, … , �2 , … �1 ) �2 ) Using multivariate series, these two sequences form a much larger feature vector for comparison. Evidently, it is impossible to compute a distance metric between two vectors of unequal dimensions. A local cost measure is defined : :ℱ × ℱ → ℝ > 0 3. Monotonic condition : the observation symbols are aligned in order of time. This is intuitive as the order of observation signals in a gesture signal should not be reversed. 1 ≤ 2 ≤⋯≤� Step size condition : No observation symbols are to be skipped ≤1 +1 − We then arrive at an overall cost function defined as : � = �2 ∈ ℱ �� �1 ∈ 1, �1 , �2 ∈ 1, �2 Accordingly, the cost measure should be low if two observations are similar and high if they are very different. Upon evaluating the cost matrix for all elements in X and Y, we obtain �1 �2 . From this local cost matrix, we wish to obtain a correspondence mapping elements in X to elements in Y that will result in a lowest distance measure. We can define this mapping correspondence as Where �= 1, 2 ,…, = ( , … (�) , In our scenario, we apply dynamic programming principles to calculate the distance to each c(k) . We define D as the accumulated cost matrix : 1. 2. 3. Initialise D(1,1) = d(x1,y1) Initialise D(T1,T2) = 32767 (arbitrary large number) Calculate �1, �2 = min �1 − 1, �2 − 1 , + ( �1 , �2 ) �1 − 1, �2 , �1, �2 − 1 Although each series consisted of 15 quaternion’s serialized, it will be split up into its individual quaternion for metric calculation. The final distance will be the sum of the distance between the 15 pairs of quaternion. Here, we can adapt the number of joints to be used in the calculation of metric distance. However, it is not trivial to just calculate the Euclidian distance of the two quaternions, as unit quaternions have two representations for each orientation. In the rotational SO(3) space, the negative of a quaternion q is equivalent to q. Hence the usual equation used for the calculation of Euclidian distance has to be modified to take into account the nonuniqueness of rotational representation [13]. So instead of ) 1 = The mapping function has to follow the time sequence order of the respective gestures. Hence, we impose several conditions on the mapping function : 1. ( ) =1 Which gives an overall cost/distance between two gestures according to a warping path, as defined by the function F. Since the function C(F) denotes all possible warping paths between two gestures observation sequences X and Y, the dynamic time warping algorithm is to find the warping path which gives the lowest cost/distance measure between the two gestures. Where �1 , � Boundary conditions : the starting and ending observation symbols are aligned to each other for both gestures. 1 = ( 1, 1) � = ( �1 , �2 ) ( we use 1− d = min⁡ ( 1 = 1 + 2 1, 1, 1, 1 2 2) + ( − 2 2 + 2 1− + 1+ 1 2 , 2 =( 2 2) + ( − 2 2 + 2 2, 1− + 1+ 1 2 2, 2, 2 ) 2 2) + ( − 2 2 + 2 1 + 1+ − 1 2 2) − 2) 2 2 2, In figure below, a warping plane is shown, where the time sequences indexes are placed on the x and y axes, and the graph shows the mapping function from the index of A to the index of B. Figure 4. Matching on similar points on signal during the performance. The depth maps were captured at about 15 frames per second by a Kinect camera that acquires the depth through structure infra-red light. Notice that the 6 actions were chosen to reasonably cover the various movement of arms, legs, torso, and their combination. C. Result Below are comparison path of clap, punch, smash, wave, run, and kick action. Each was performed five times. As current result, upper part generated action (clap, punch, smash, and wave) can be recognized quite well but we have to collect more data to do a benchmarking. However, lower part generated action still have to be improved in the recognition. 1 As a typical NN algorithm, there is no specific learning phase. Our system stores a list of multivariate time series of known activities and their corresponding labels in a database. When an unknown action is presented to the system, the system takes the unknown time series, performs a sequential search with lower bounding DTW. 0.9 0.8 0.7 0.6 0.5 0.4 IV. EXPERIMENT 0.3 A. Equipment Setup Motion capture was done using Kinect camera and pose estimation was processed using OpenNI with Primesense NUI library[14]. Kinect camera is shown in figure below. 0.2 0 2 4 6 8 10 12 14 16 18 16 18 16 18 Figure 6. motion path of the action “clap” 0.65 0.6 0.55 0.5 0.45 0.4 Figure 5. Microsoft Kinect TABLE I MICROSOFT KINECT SPECIFICATION Sensor item Viewing angle Mechanized tilt range (vertical) Frame rate (depth and color stream) Resolution, depth stream Resolution, color stream Specification range 43° vertical by 57° horizontal field of view ±28° 0.35 0.3 0.25 0.2 0 2 4 6 8 10 12 14 Figure 7. motion path of the action “punch” 0.9 30 frames per second (FPS) 0.85 0.8 QVGA (320 × 240) VGA (640 × 480) 0.75 0.7 0.65 B. Dataset Since no public benchmarking datasets which supplies the sequences of depth maps are available, we collected a dataset that contains six actions : two hand wave, hand clap, tennis smash, boxing, side kick, and jogging. Each action was performed for five times. The actor was facing the camera 0.6 0.55 0.5 0.45 0 2 4 6 8 10 12 14 Figure 8. motion path of the action “smash” warping and involve more subjects in the actions for experiment. 1 0.8 0.6 ACKNOWLEDGMENT Samsu Sempena thanked Dr. Nur Ulfa Maulidevi S.T.,M.Sc and Peb Ruswono Aryan M.T for continuous support and improving feedback during this research. 0.4 0.2 0 -0.2 -0.4 REFERENCES -0.6 -0.8 0 2 4 6 8 10 12 14 16 18 Figure 9. motion path of the action “wave” [1] [2] 1 [3] [4] [5] 0.98 0.96 [6] 0.94 [7] 0.92 [8] 0.9 [9] 0.88 0 2 4 6 8 10 12 [10] Figure 10. motion path of the action “run” [11] 1 0.98 [12] 0.96 0.94 [13] 0.92 0.9 [14] 0.88 [15] 0.86 0 2 4 6 8 10 12 Figure 11. motion path of the action “kick” V. CONCLUSION AND FUTURE WORK Recognition of human actions is still in its infancy compared to other intensively studied topics like human detection and tracking. This paper has presented a distance time warping approach to recognize human actions. Depth camera was also used to capture the human motion. Dynamic time warping is good to recognize simple actions, however to analyse more complex human action performed by multiple people, basic Dynamic Time Warping is considered not to be the best approach to cover all possible scenario. For future work, we plan to experiment on different distance metric calculation using weighted distance time View publication stats Aggarwal, Q.Cai. “Human Motion Analysis : A Review”. Computer Vision and Image Understanding, 73(3):428-440,1999. Aggarwal, M.S.Ryoo. “Human Activity Analysis : A Review”. In proceeding : ACM Computing Surveys, 2011. D.M. Gavrila, L.S. Davis. “Towards 3-D model-based tracking and recognition of human movement: a multi-view approach”. 1995. Kirk Baker. Singular Value Decomposition Tutorial. 2005 T.B. Moeslund, A. Hilton, and V. Kruger. “A Survey of Advances in Vision-Based Human Motion Capture and Analysis”. CVIU 104(2-3): 90-126, 2006. Muller, M. Information Retrieval for Music and Motion. Springer, 2007. Wanqing Li, Zhengyou Zhang, Zicheng Liu. “Action Recognition Based on A Bag of 3D Points”. IEEE International Workshop on CVPR for Human Communicative Behavior Analysis, 2010. Zaw Zaw Htike, Simon Egerton, Kuang Ye Chow. “Model-free Viewpoint Invariant Human Activity”. IMECS,2011 P. Senin, “Dynamic Time Warping Algorithm Review”, Honolulu USA December 2008. M.Blank, L.Gorelick, E.Shecthman, M.Irani,and R.Basri. “Actions as space-time shapes”. In ICCV, pages 1395-1402, Beijing, 2005. J. Leens, S. Pierard, O. Barnich, M. Van Droogenbroeck, and J.M Wagner. “Combining Color, Depth,and Motion for Video Segmentation”. In ICVS 2009, Liege, Belgium. J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A.Kipman, A. Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images. Microsoft Research Cambridge & Xbox Incubation, 2011. Ho Chun Jian. Gesture Recognition Using Windowed Dynamic Time Warping. M.Eng. thesis, National University of Singapore, Singapore, 2010. (2010) Prime Sensor NITE 1.3 Algorithms notes. [Online]. Available : http://www.primesense.com. Baker,Martin. Matrix to Quaternion. [Online]. Available : http://www.euclideanspace.com/maths/geometry/rotations/conversions/ matrixToQuaternion/index.htm