See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221013866
Human action recognition using Dynamic Time
Warping
Conference Paper · July 2011
DOI: 10.1109/ICEEI.2011.6021605 · Source: DBLP
CITATIONS
READS
40
604
3 authors, including:
Nur Ulfa Maulidevi
Peb Ruswono Aryan
20 PUBLICATIONS 65 CITATIONS
17 PUBLICATIONS 46 CITATIONS
Bandung Institute of Technology
SEE PROFILE
Bandung Institute of Technology
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Knowledge Model to support autonomous knowledge transfer between Knowledge-based Systems
View project
All content following this page was uploaded by Peb Ruswono Aryan on 02 October 2014.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
K3 - 3
2011 International Conference on Electrical Engineering and Informatics
17-19 July 2011, Bandung, Indonesia
Human Action Recognition Using Dynamic Time Warping
Samsu Sempena 1, Dr. Nur Ulfa Maulidevi, S.T, M.Sc 2, Peb Ruswono Aryan,M.T 3
Institut Teknologi Bandung/School of Electrical Information Engineering
Jl. Ganesha 10, Bandung, Indonesia
1
samsu@students.itb.ac.id
2
ulfa@stei.itb.ac.id
3
peb@stei.itb.ac.id
Abstract— Human action recognition is gaining interest from
many computer vision researchers because of its wide variety of
potential applications. For instance: surveillance, advanced
human computer interaction, content-based video retrieval, or
athletic performance analysis. In this research, we focus to
recognize some human actions such as waving, punching,
clapping, etc. We choose exemplar-based sequential singlelayered approach using Dynamic Time Warping (DTW) because
of its robustness against variation in speed or style in performing
action. For improving recognition rate, we perform body part
tracking using depth camera to recover human joints body part
information in 3D real world coordinate system. We build our
feature vector from joint orientation along time series that
invariant to human body size. Dynamic Time Warping is then
applied to the resulted feature vector. We examine our approach
to recognize several actions and we confirm our method can
work well with several experiments. Further experiment for
benchmarking the result will be held in near future.
Keywords— human action recognition, exemplar-based approach,
Dynamic Time Warping, depth camera
I. INTRODUCTION
Human action recognition is gaining interest from many
computer vision researchers in the past two decades. It is
motivated by a wide variety of potential application, such as
surveillance applications in public spaces to detect abnormal or
suspicious activities, medical monitoring (children or elderly
persons), athletic performance monitoring, context-aware
pervasive system, and another advanced human-computer
interaction [1, 2, 5].
However, there are numbers of reason why human activity
recognition is a very challenging problem. Firstly, human body
is non rigid and has many degrees of freedom. The drawback is
human body can generate infinite variations for every basic
movements. Secondly, there is no two persons are identical in
terms of body shape, volume, and gesture style. Those
mentioned problems get more complex by uncertainties such
as variation in viewpoint, illumination, shadow, self-occlusion,
deformation, noise, clothing, and so on[8]. Since the problem
is very vast, usually researchers make a set of assumptions to
make the problem more tractable.
In the other hand, current research showed that combining
color, depth, and motion can improve segmentation result[11].
This research looks more promising today, since depth camera
is becoming ubiquitous, easy to use, and reasonable price.
Additionally, computer vision algorithms are becoming more
978-1-4577-0752-0/11/$26.00 ©2011 IEEE
mature for tracking human poses accurately given reliable
segmentation result. These two improvements become the
baseline for our research. Combing depth camera and mature
computer vision algorithm, we bound our research to recognize
human action given human pose estimation in 3D coordinate
system. Related methods for low level processing can be found
in vast literature available.
Aggarwal and Ryoo have provided a tidy overview of
state-of-the-art in human activity recognition methodologies.
Based on they overview, our research can be classified in the
part of sequential, single-layered approaches using exemplarbased. We choose exemplar-based sequential single-layered
approach to recognize human action because of its robustness
against variation in speed or style in performing action.
Moreover, it requires less training data than state-based
approach [2].
Several researches using Dynamic Time Warping from
another authors recently for example Zaw(2011) use multiple
camera to acquire depth information [8]. Ho(2010) use inertial
sensor to get human joints body part information for
recognizing upper body generated action [13].
In our approach, each frame from depth camera is
considered as observation (feature vector) and deduces that an
activity has occurred in the video if it is able to observe a
particular sequence represent the activity. Sequential approach
first converts a sequence of images into a sequence of feature
vectors by extracting features. Exemplar-based recognition
describe classes of human actions using training samples
directly and maintain either a representation sequence per
activity, and match them with a new sequence to recognize its
activity.
Feature vector is built using joint orientation of each body
part joints describing the action of a person per each frame .
Joint orientation was selected because it’s invariant to human
body size. For recognition action, we compare video input to
list of defined action using Dynamic Time Warping method.
The remaining parts of this paper are organized as follows:
in section 2, methods use in this research are described and in
section 3 experiments on human action recognition are
presented and finally followed by conclusion and future work
in section 4.
II. ACTION REPRESENTATION
Quaternions (w, x, y, z) typically represent a rotation about
the (x, y, z) axis by an angle of
2.1 Pose Estimation
Human motion is often considered as a continuous
evolution of the spatial configuration of the segments or body
posture. If body joints can be reliably extracted and tracked
from a sequence depth maps, action recognition can be
achieved by using the tracked joint positions [7]. Therefore,
firstly we estimate pose of human from the depth maps.
Although pose estimation is an ill-problem in computer
vision because of its complexity caused by many variations
between human, recent research showed that using single
depth maps, human body parts can be recognized quite well in
real time[12]. Therefore, we use this pose estimation result as
preliminary data for our research.
Compared to 3-by-3 rotational matrices, quaternions are
also more compact, requiring only 4 storage units, instead of 9.
These properties of quaternions make their use favourable for
representing rotational representations.
We represent human pose using stick figure consisted of 15
joints of body part, which is shown in figure 1.
Figure 2. Graphical representation of quaternion units product as 90 degree
rotation in 4D-space
Firstly, we convert our orientation from Euler angles to
quaternion using method described in [15].
The feature vector was formed by concatenating the 15
quaternions of the respective body parts to form a column
vector of 60 elements.
Figure 1. Stick figure representation with 15 joints of body part
For rotation and scale invariant, we use joint orientation of
each joint relative to world coordinate rather than joint
position to describe human motion.
2.2 Rotational Representation
We present orientation using quaternion. Quaternions are a
compact and complete representation of rotations in 3D space
compared with Euler angles.
Quaternions are built from 4 dimensions tuples (W,X,Y,Z).
In a quaternion representation of rotation, singularities are
avoided, therefore giving a more efficient and accurate
representation of rotational transformations. A quaternion,
which is of 4 dimensions, has a norm of 1, and is typically
represented by one real dimension and three imaginary
dimensions. The three imaginary dimensions, which are i, j,
and k, are unit length and orthogonal one another.
= + + +
+ 2+ 2+ 2=1
= , =− , =−
2
Figure 3. Sequences of depth maps overlayed with segmented
human region and skeleton tracking result for some human
actions: clap, wave, smash
III. DYNAMIC TIME WARPING
2.
Activity recognition here is the process of classifying
multivariate series. We classify activities using nearest
neighbour algorithm with dynamic time warping (DTW) as
the distance measure. DTW is a well-known algorithm in
many areas especially in speech recognition. However since
gesture and speech having many similar characteristics, such
as varying in duration and feature, therefore techniques used
for speech recognition have often been adapted and used in
gesture recognition. DTW algorithm is very popular since it
can be being extremely efficient in the time-series similarity
measure which minimizes the effects of shifting and distortion
in time by creating warping path to detect similar shapes with
different phases [9]. DTW yields optimal solution in the
O(MN) polynomial time which could be improved further
through different technique such as multi-scaling.
In our action recognition system, we express feature vectors of
two of the gestures to be compared against each other as, two
time series X and Y.
=(
=(
1, , 2 , … , �1 , …
1, ,
2, … ,
�2 , …
�1 )
�2 )
Using multivariate series, these two sequences form a much
larger feature vector for comparison. Evidently, it is
impossible to compute a distance metric between two vectors
of unequal dimensions. A local cost measure is defined :
:ℱ × ℱ → ℝ > 0
3.
Monotonic condition : the observation symbols are
aligned in order of time. This is intuitive as the order of
observation signals in a gesture signal should not be
reversed.
1 ≤ 2 ≤⋯≤�
Step size condition : No observation symbols are to be
skipped
≤1
+1 −
We then arrive at an overall cost function defined as :
� =
�2
∈ ℱ �� �1 ∈ 1, �1 , �2 ∈ 1, �2
Accordingly, the cost measure should be low if two
observations are similar and high if they are very different.
Upon evaluating the cost matrix for all elements in X and Y,
we obtain �1 �2 . From this local cost matrix, we wish to
obtain a correspondence mapping elements in X to elements
in Y that will result in a lowest distance measure. We can
define this mapping correspondence as
Where
�=
1,
2 ,…,
= (
, … (�)
,
In our scenario, we apply dynamic programming principles to
calculate the distance to each c(k) . We define D as the
accumulated cost matrix :
1.
2.
3.
Initialise D(1,1) = d(x1,y1)
Initialise D(T1,T2) = 32767 (arbitrary large number)
Calculate
�1, �2 = min
�1 − 1, �2 − 1 ,
+ ( �1 , �2 )
�1 − 1, �2 ,
�1, �2 − 1
Although each series consisted of 15 quaternion’s serialized, it
will be split up into its individual quaternion for metric
calculation. The final distance will be the sum of the distance
between the 15 pairs of quaternion. Here, we can adapt the
number of joints to be used in the calculation of metric
distance. However, it is not trivial to just calculate the
Euclidian distance of the two quaternions, as unit quaternions
have two representations for each orientation. In the rotational
SO(3) space, the negative of a quaternion q is equivalent to q.
Hence the usual equation used for the calculation of Euclidian
distance has to be modified to take into account the nonuniqueness of rotational representation [13]. So instead of
)
1
=
The mapping function has to follow the time sequence order
of the respective gestures. Hence, we impose several
conditions on the mapping function :
1.
( )
=1
Which gives an overall cost/distance between two gestures
according to a warping path, as defined by the function F.
Since the function C(F) denotes all possible warping paths
between two gestures observation sequences X and Y, the
dynamic time warping algorithm is to find the warping path
which gives the lowest cost/distance measure between the two
gestures.
Where
�1 ,
�
Boundary conditions : the starting and ending observation
symbols are aligned to each other for both gestures.
1 = ( 1, 1)
� = ( �1 , �2 )
(
we use
1−
d = min
(
1
=
1
+
2
1, 1, 1, 1
2
2) + (
−
2
2
+
2
1−
+
1+
1
2
,
2
=(
2
2) + (
−
2
2
+
2
2,
1−
+
1+
1
2
2,
2, 2 )
2
2) + (
−
2
2
+
2
1
+
1+
−
1
2
2)
−
2)
2
2
2,
In figure below, a warping plane is shown, where the time
sequences indexes are placed on the x and y axes, and the
graph shows the mapping function from the index of A to the
index of B.
Figure 4. Matching on similar points on signal
during the performance. The depth maps were captured at
about 15 frames per second by a Kinect camera that acquires
the depth through structure infra-red light. Notice that the 6
actions were chosen to reasonably cover the various
movement of arms, legs, torso, and their combination.
C. Result
Below are comparison path of clap, punch, smash, wave, run,
and kick action. Each was performed five times. As current
result, upper part generated action (clap, punch, smash, and
wave) can be recognized quite well but we have to collect
more data to do a benchmarking. However, lower part
generated action still have to be improved in the recognition.
1
As a typical NN algorithm, there is no specific learning
phase. Our system stores a list of multivariate time series of
known activities and their corresponding labels in a database.
When an unknown action is presented to the system, the
system takes the unknown time series, performs a sequential
search with lower bounding DTW.
0.9
0.8
0.7
0.6
0.5
0.4
IV. EXPERIMENT
0.3
A. Equipment Setup
Motion capture was done using Kinect camera and pose
estimation was processed using OpenNI with Primesense NUI
library[14]. Kinect camera is shown in figure below.
0.2
0
2
4
6
8
10
12
14
16
18
16
18
16
18
Figure 6. motion path of the action “clap”
0.65
0.6
0.55
0.5
0.45
0.4
Figure 5. Microsoft Kinect
TABLE I
MICROSOFT KINECT SPECIFICATION
Sensor item
Viewing angle
Mechanized tilt range
(vertical)
Frame rate (depth and
color stream)
Resolution, depth stream
Resolution, color stream
Specification range
43° vertical by 57°
horizontal field of view
±28°
0.35
0.3
0.25
0.2
0
2
4
6
8
10
12
14
Figure 7. motion path of the action “punch”
0.9
30 frames per second (FPS)
0.85
0.8
QVGA (320 × 240)
VGA (640 × 480)
0.75
0.7
0.65
B. Dataset
Since no public benchmarking datasets which supplies the
sequences of depth maps are available, we collected a dataset
that contains six actions : two hand wave, hand clap, tennis
smash, boxing, side kick, and jogging. Each action was
performed for five times. The actor was facing the camera
0.6
0.55
0.5
0.45
0
2
4
6
8
10
12
14
Figure 8. motion path of the action “smash”
warping and involve more subjects in the actions for
experiment.
1
0.8
0.6
ACKNOWLEDGMENT
Samsu Sempena thanked Dr. Nur Ulfa Maulidevi
S.T.,M.Sc and Peb Ruswono Aryan M.T for continuous
support and improving feedback during this research.
0.4
0.2
0
-0.2
-0.4
REFERENCES
-0.6
-0.8
0
2
4
6
8
10
12
14
16
18
Figure 9. motion path of the action “wave”
[1]
[2]
1
[3]
[4]
[5]
0.98
0.96
[6]
0.94
[7]
0.92
[8]
0.9
[9]
0.88
0
2
4
6
8
10
12
[10]
Figure 10. motion path of the action “run”
[11]
1
0.98
[12]
0.96
0.94
[13]
0.92
0.9
[14]
0.88
[15]
0.86
0
2
4
6
8
10
12
Figure 11. motion path of the action “kick”
V. CONCLUSION AND FUTURE WORK
Recognition of human actions is still in its infancy
compared to other intensively studied topics like human
detection and tracking. This paper has presented a distance
time warping approach to recognize human actions. Depth
camera was also used to capture the human motion. Dynamic
time warping is good to recognize simple actions, however to
analyse more complex human action performed by multiple
people, basic Dynamic Time Warping is considered not to be
the best approach to cover all possible scenario.
For future work, we plan to experiment on different
distance metric calculation using weighted distance time
View publication stats
Aggarwal, Q.Cai. “Human Motion Analysis : A Review”. Computer
Vision and Image Understanding, 73(3):428-440,1999.
Aggarwal, M.S.Ryoo. “Human Activity Analysis : A Review”. In
proceeding : ACM Computing Surveys, 2011.
D.M. Gavrila, L.S. Davis. “Towards 3-D model-based tracking and
recognition of human movement: a multi-view approach”. 1995.
Kirk Baker. Singular Value Decomposition Tutorial. 2005
T.B. Moeslund, A. Hilton, and V. Kruger. “A Survey of Advances in
Vision-Based Human Motion Capture and Analysis”. CVIU 104(2-3):
90-126, 2006.
Muller, M. Information Retrieval for Music and Motion. Springer,
2007.
Wanqing Li, Zhengyou Zhang, Zicheng Liu. “Action Recognition
Based on A Bag of 3D Points”. IEEE International Workshop on
CVPR for Human Communicative Behavior Analysis, 2010.
Zaw Zaw Htike, Simon Egerton, Kuang Ye Chow. “Model-free
Viewpoint Invariant Human Activity”. IMECS,2011
P. Senin, “Dynamic Time Warping Algorithm Review”, Honolulu
USA December 2008.
M.Blank, L.Gorelick, E.Shecthman, M.Irani,and R.Basri. “Actions as
space-time shapes”. In ICCV, pages 1395-1402, Beijing, 2005.
J. Leens, S. Pierard, O. Barnich, M. Van Droogenbroeck, and J.M
Wagner. “Combining Color, Depth,and Motion for Video
Segmentation”. In ICVS 2009, Liege, Belgium.
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore,
A.Kipman, A. Blake. Real-Time Human Pose Recognition in Parts
from Single Depth Images. Microsoft Research Cambridge & Xbox
Incubation, 2011.
Ho Chun Jian. Gesture Recognition Using Windowed Dynamic Time
Warping. M.Eng. thesis, National University of Singapore, Singapore,
2010.
(2010) Prime Sensor NITE 1.3 Algorithms notes. [Online]. Available :
http://www.primesense.com.
Baker,Martin. Matrix to Quaternion. [Online]. Available :
http://www.euclideanspace.com/maths/geometry/rotations/conversions/
matrixToQuaternion/index.htm