Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
E-ISSN: 1817-3195
GLOBAL DOMINANT SIFT FOR VIDEO INDEXING AND
RETRIEVAL
KAMAL ELDAHSHAN1, HESHAM FAROUK2, AMR ABOZEID3, M. HAMZA. EISSA4
1,3,4
Dept. of Mathematics, Computer Science Division,
Faculty of Science, Al-Azhar University, Cairo, Egypt.
2
Computers and Systems Dept., Electronics Research Institute, Cairo, Egypt.
E-mail: 1dahshan@gmail.com, 2hesham@eri.sci.eg,
aabozeid@azhar.edu.eg, 4mohammed_essa2001@yahoo.com
3
ABSTRACT
The massive volume of videos is highly demanding for produce an efficient and effective video indexing
and retrieving frameworks. Extracting and representation of visual features plays a significant role in the
video/image retrieval and computer vision. This paper proposes a new compact descriptor named Global
Dominant Scale Invariant Feature Transform (GD-SIFT). The GD-SIFT requires fewer bits (16 bits) to
represent each visual feature. Importantly, the proposed descriptor is vocabulary-free, training-free and
suitable for online and real-time applications. Also, this paper proposes a new video indexing and retrieving
framework based on the proposed GD-SIFT descriptor. The proposed framework is a content-based video
indexing and retrieving, which helps to retrieve videos by text (e.g. Video name or metadata), image (video
frame) or video clip. The experiments carried out on the standard Stanford I2V dataset. Our experiments
demonstrated that, the GD-SIFT descriptor is more efficient (in terms of speed and storage) and achieved
high accuracy (about 78%) with respect to the related works. Moreover, the results indicated that, the
proposed descriptor is more robust to variations (e.g. Scale, rotation, etc.).
Keywords: Video Indexing, Video Search, SIFT, Descriptor, Query-By-Image
1.
INTRODUCTION
The availability of communications, video
recording devices and low cost storage
technologies, allows a user to record and then
create huge video databases. There is an increase
demand for an efficient and effective indexing and
retrieving framework to maintain the huge video
databases. Indexing and retrieving videos from a
huge video database is a challenging task [1, 2]. As
result, many content-based video indexing and
retrieval (CBVR) frameworks have been proposed
in the literature.
CBVR can defined as "the automatic
process of content-based classification of video
data for fast access and retrieval" [3]. This
definition mean that, extract information from the
video content to perform specified queries [4].
Content-based refers to the actual video contents
which might be local visual features (like colors,
texture, motion or objects) and audio features [5].
Video is complex and contains a large
amount of visual features. However, these features
should be extracted, analyzed and stored in an
efficient way. During the last decades, the number
of different visual features which proposed in
literature has increased significantly [6]. Many
local features were proposed to be faster, more
distinctive and robust under many different
variations (e.g. scale, rotation, etc.) [7] .
Some popular and successful local features
developed during the recent decade are Scale
Invariant Feature Transform (SIFT) [8], Principal
Component Analysis (PCA)-SIFT [9], Speeded Up
Robust Features (SURF) [10] and Histogram of
Oriented Gradients (HOG) [11]. Traditional local
features have limitations in ubiquitous and realtime applications because of their large size (e.g.
128 bytes for a SIFT key-point) [12, 13].
Recently, binary features, such as Binary
Robust
Independent
Elementary
Features
(BRIEF)[14], Binary Robust Invariant Scalable
5023
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
Key-points (BRISK) [7] and Fast Retina Key-point
(FREAK) [15], are proposed to represent the local
feature in a more distinctive way. However, these
features are still large in size (e.g. ≥ 16 bytes per
each key-point) while some low bit-rate
image/video retrieval applications aim to be much
smaller (e.g. ≤100 bits per feature) [16, 17].
The author’s in [18] presented a compact
SIFT descriptor ( called dominant SIFT ) which
only uses 48 bits to describe the local features (each
key-point) of the image. The main advantages of
this descriptor are training-free, vocabulary-free
and suitable for ubiquities and real-time
applications.
This paper extends of the dominant SIFT
and proposes a GD-SIFT for video features. The
GD-SIFT more compact than the dominant SIFT
and uses 16 bits (instead of 128 bits for SIFT [8]
and 48 bits for dominant SIFT [19]) to describe
each key-point. Also, we propose a framework for
video indexing and retrieval based on the GD-SIFT
and the time-constraint cluster algorithm. The
proposed framework is a web application which
helps user to upload and search for videos. The user
can retrieve videos by text (e.g. video name or
metadata), image (video frame) or video clip. The
experiment results shows that, the GD-SIFT is
more efficient (in terms of speed and storage) and
achieved high accuracy (an average of 78%) with
respect to the original SIFT [8] and the dominant
SIFT [19] descriptors. Moreover, the proposed
descriptor more robust to variations (e.g. Scale,
rotation, etc.). Importantly, the proposed descriptor
is vocabulary-free and training-free. Therefore, the
proposed GD-SIFT is suitable for online and realtime applications.
The paper is organized as: section 2
introduces a related work. Section 3 explains the
proposed GD-SIFT methodology. Section 4
discusses the experimental results. Finally, section
5 concludes the paper and suggests future work.
2.
RELATED WORK
The SIFT feature includes two main parts:
key-point detector and SIFT descriptor [8]. The
key-point detector scans the input image to detect
the interest points. Firstly, Gaussian filter of
different scales is applied on the input image and
then re-sized to produce a Gaussian scale-space.
Neighboring images with the same resolution in
this scale-space are subtracted to get the Difference
E-ISSN: 1817-3195
of Gaussian (DoG) pyramid. The key-point is taken
if and only if it is a local extremum in the DoG
pyramid. The key-point localization is the last step
applied to get the most stable key-points.
The standard key-point descriptor used by
SIFT is created by sampling the magnitudes and
orientations of the image gradient in the patch
around the key-point, and building smoothed
orientation histograms to capture the important
aspects of the patch. A 4×4 array of histograms,
each with 8 orientation bins, captures the rough
spatial structure of the patch. This 128-element
vector is then normalized to unit length and
thresholded to remove elements with small values.
The SIFT key-points are particularly
useful due to their distinctiveness, which enables
the correct match for a key-point to be selected
from a large database of other key-points. This
distinctiveness is achieved by assembling a highdimensional vector representing the image
gradients within a local region of the image. The
key-points have been shown to be invariant to
image rotation and scale and robust across a
substantial range of affine distortion, addition of
noise, and change in illumination. Large numbers
of key-points can be extracted from typical images,
which leads to robustness in extracting small
objects among clutter.
The fact that key-points are detected over
a complete range of scales means that small local
features are available for matching small and highly
occluded objects, while large key-points perform
well for images subject to noise and blur. Their
computation is efficient, so that several thousand
key-points can be extracted from a typical image
with near real-time performance on standard PC
hardware.
Ke et al. [9] introduced an alternate
representation for local image descriptors for the
SIFT algorithm. Compared to the standard
representation, PCA-SIFT is both more distinctive
and more compact leading to significant
improvements in matching accuracy (and speed) for
both controlled and real-world conditions. Each 41pixel-by-41-pixel image patch centering at each
key-point is extracted and rotated to line up with its
dominant orientation. Gradient values in the xdirection and the y-direction for all pixels in the
image patch are calculated to form a 2x39x39 =
3042-dimension vector.
5024
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
Despite its name, PCA-SIFT does not
reduce the SIFT feature vector, but the
dimensionality of the detected interest points.
Furthermore, each 3042 feature vector is projected
onto a low-dimensional space. In order to execute
this last task, a projection kernel is pre-computed
using PCA over 21000 patches collected from
diverse images that are not used later. This new
less-dimensional feature vector speeds up
applications using it, however, it may lead to less
accurate results than those obtained by using SIFT
descriptors. PCA-SIFT is demonstrated to achieve
better results when it reduces its descriptor to a 36dimensional feature vector.
E-ISSN: 1817-3195
The dimension of SIFT vector can be
directly reduced by using PCA transform. Similarly
to PCA-SIFT, a PCA transform matrix is prelearned from an image database. At mobile devices,
SIFT features extracted from query images are
applied with PCA transform to achieve a more
compact descriptor. This new compact descriptor is
called as Reduced SIFT [20].
SIFT uses only grayscale information to
detect key-points. Therefore, a lot of color
information is discarded for color images. Alaa et
al. [21] proposed a color SIFT ( CSIFT), which
combines color variance with the basis of SIFT and
intends to beat the flaw of SIFT for color images.
Table 1: Comparisons Between SIFT and Its Variants.
Key-point Detection
Scale space
SIFT [8]
PCA-SIFT [9]
CSIFT [21]
Multi-scale images
convoluted by a
Gaussian function
Similar to SIFT
Combine grayscale
information with
color information.
Convolute by a
Gaussian function
Dominant
SIFT [19]
Similar to SIFT
GD-SIFT
(Our
proposed)
Similar to SIFT
Selection
Detect extrema
in Difference
of Gaussian
space (DoG)
Similar to
SIFT
Key-point Description
Main direction
Feature
Compute a gradient
amplitude of a
square area (16×16).
Select the direction
with the maximum
gradient amplitude
as the main direction
Similar to SIFT
Similar to
SIFT
Similar to SIFT
Similar to
SIFT
Similar to SIFT
Similar to
SIFT
Similar to SIFT
Extraction
Divide a 16×16 region into 4×4
sub-regions;
Compute a gradient histogram
for each sub-region
Extract a 41×41 patch.
Construct a 3042-dimensional
vector.
Use a project matrix to reduce
the dimensionality
Similar to SIFT
Size
(bits)
128
<=20
384
Divide a 16×16 region into 4×4
48
Divide a 16×16 region into 4×4
16
sub-regions;
Compute a gradient histogram
for each sub-region
Compute a dominate gradient
histogram
sub-regions;
Compute a gradient histogram
for each sub-region
Encode the global dominate
gradient histogram using timecluster algorithm
To achieve a more compact descriptor,
hashing, vector quantization (VQ) and transform
coding (TC) are also considered [22]. Hashing is an
effective way to represent the local feature by using
a few bits [16], but it depends a lot on its hash
functions. VQ technique represents each local
5025
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
feature by a code-word of a pre-trained vocabulary
[23], but the large size of vocabulary becomes a
problem for devices having small memory [17]. TC
framework maps the local feature from original
feature space into the transform space using PCA
technique which produces a small reconstruction
error when reducing feature dimensions [17].
Tra et al. [18] presented a compact SIFT
descriptor (called dominant SIFT) which only uses
48 bits to describe the local features (each keypoint) of the image. The main advantages of this
descriptor are training-free, vocabulary-free and
suitable for ubiquities and real-time applications.
The SIFT and its variants methodology
consists of two main steps: key-point detection and
description. Based on this methodology, Table 1
summarizes the comparisons between SIFT and its
variants including the proposed GD-SIFT.
3.
E-ISSN: 1817-3195
METHODOLOGY
Given a video as a new input to the
database, the video key-frames descriptors are
extracted and they are stored in the system
database. Once the system receives a query image,
the similarity between the query image descriptor
and the descriptors already stored in the database is
measured by the Brute-Force matcher [24]. The
resulted videos are ranked based on the measured
similarity between the input descriptor and the
already stored descriptors. The global overview of
the proposed framework is shown in Fig. 1. The
whole process is divided into two main stages:
index stage (offline stage) and retrieval stage
(online stage), all of them are explained as follows.
Figure 1: The Proposed Framework Architecture
3.1 Index Stage: An Offline Stage
The goal of this stage is to analysis the
input video in order to get a better compact
descriptor of the entire video frames. The index
stage consists of three main steps; all of them are
explained as follows.
Step-A1: Video segmentation
The input video is considered as a
collection of representative key-frames which will
5026
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
be processed to extract feature descriptors that
represent its content. Since the computational cost
is proportional to the amount of data (video frames)
being processed, two steps are performed to reduce
the quantity of data in both temporal and spatial
domains: key-frame pre-sampling and resizing.
The objective of pre-sampling and resizing
are reduce the computational complexity. The keyframe sampling method is depend on there are a
visual redundancy between the consecutive frames
in each second. Subsequently, instead of process
the entire video frames a subset frame are processed
based on a predefined sampling rate. The sampling
rate can be defined by the second or by the frame
number [25]. In this step, we select one frame per
second to be the key-frame. Then, each selected
key-frame is re-scaled to CIF (352 x 240)
resolution.
Step-A2: GD-(GD) SIFT descriptor
The SIFT algorithm constructs a
description for each key-point based on a 4×4 patch
of pixels around the key-point. The final SIFT
descriptors are constructed from 16 sub-histograms
corresponding to 4×4 patch of pixels. In each patch
sixteen gradients are quantized into 8 bins of the
sub-histogram. Based on the statistical experiment
in [19], there are a stronger correlation between
bins in the same sub-histogram correlation than
bins in different sub-histograms of a SIFT
descriptor. Moreover, the sub-histogram values
often concentrate on two or three adjacent bins after
a circular shift.
For each SIFT vector: 𝑎
∩ ,
𝑤ℎ𝑒𝑟𝑒 𝑎
𝑎 𝜖𝑧 ∩ 0,256 |𝑖 ∈ 𝑧 ∩ 0,7
is a 8-bin sub-histogram. Suppose that 𝐶𝑆 𝑎 , 𝑖
be the consecutive sum-n at the index i which is
defined as:
𝐶𝑆 𝑎 , 𝑖
𝑎
1
∀ 𝑚 ∈ 𝑍 ∩ 8, ∞ and 𝑛 ∈
𝑎
Where 𝑎
1,2,3,4 , Let 𝑀𝐶𝑆 𝑎
be the maximum of
𝐶𝑆 𝑎 , 𝑖 where 𝑖 ∈ 𝑍 ∩ 0,7 .
Algorithm 1 describes the GD-SIFT
descriptor generation. Only 8 positions are
available in the consecutive sum-n. Therefore,
during the experiment, we compute sum-3 and
sum-1 to represent the whole SIFT descriptor for
each key-point as 48 bits and 16 bits, respectively.
E-ISSN: 1817-3195
Algorithm 1: GD-SIFT descriptor generation
Input: 𝐹 , 𝑘
1,2, … , 𝑚 // the set of key-frames
Output: 𝐺𝐷 𝐹 , 𝑘
1,2, … , 𝑚 // a GDSIFT for the key-frames
Start
1. For each 𝐹 , 𝑘 1,2, … , 𝑚
// The 128 SIFT
1.1. Compute 𝐷 𝐹
descriptor for 𝐹
1.2. Separate 𝐷 𝐹 into 16 sub-vectors
a
a ,……..,a
, j ∈ Z ∩ 0,15 .
1.3. Find the position of the maximum
consecutive sum-n of
𝑎 :𝑝
𝑎𝑟𝑔𝑚𝑎𝑥 ∈
𝐶𝑆 𝑎 , 𝑖 .
1.4. Encode the SIFT feature by 𝑛 16 bits
2. End loop
End
∩ ,
Step-A3: Clustering (time-constraint
algorithm)
The objective of this step is to group the
similar descriptors together and then select the
most global descriptor per each group. The resulted
representative descriptors
should reflect the
content and the structure of the video [26].
Therefore, we adopt the time-constraint cluster
algorithm, as demonstrated in algorithm 2. The
advantage of the time-constraint cluster is to group
similar video frames together with its natural time
ordering.
The time-constraint cluster algorithm has
O(n) complexity where n is the number of video
key-frames.
Where BF is the Brute Force matcher [24]
and the threshold ε is used to control the similarity
between descriptors. Through the experiment, we
examined different threshold values and found that
values between 0.04 and 0.2 are often good values.
For each cluster, the representative
(global) descriptor is constructed by select the keypoints that appear in all descriptors within the
cluster.
5027
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
E-ISSN: 1817-3195
Algorithm 2: the time-constraint cluster
algorithm
Input: 𝐺𝐷 𝐹 , 𝑘
1,2, … , 𝑚 // a GD-SIFT
FFmpeg [29] libraries. All the experiments were
performed on a computer device equipped with
Intel (IR) core ™ i7 CPU and 8GB of RAM.
clusters
The experiments carried out on 30 videos
from the standard dataset developed by Stanford
[30]. The descriptions of these videos are listed in
Table 2. All videos are in H.264/mp4 format with
different properties. To evaluate the proposed
framework, we build a video database of about 2
hours, i.e. about 201270 video frames.
for the key-frames
Output: 𝐶 , 𝑖
1,2, … , 𝑠 ; 𝑠
𝑚
4.1 Dataset
// a set of
Start
1. Initialize 𝑖 1
2. Add 𝐹 to the cluster 𝐶 whose cluster
3.
4.
5.
6.
7.
8.
9.
10.
centroid is 𝑜
𝐺𝐷 𝐹
Loop for each 𝑘: 2 → 𝑚
If BF 𝐺𝐷 𝐹
,𝑜
𝜀 ) then
Add 𝐹 to the cluster 𝐶
Update 𝑜
𝑘
Else
𝑖
𝑘
1
𝑖
1
11.
𝑘 𝑘
12. End loop
End
1
BF 𝐺𝐷 𝐹
4.2 Quality Evaluation
,𝑜
The quality of retrieved videos is
measured by compute the accuracy and compare it
with the traditional SIFT [8] and the Dominate
SIFT [19].
Add 𝐹 to the cluster 𝐶 whose cluster
centroid is 𝑜
𝐺𝐷 𝐹
Given a query image, we retrieved top n
(e.g. 20-50) relevant videos and then computed the
accuracy as follows:
3.2 Query Stage: An Online Stage
The goal of this stage is to analysis the
input query image in order to get a better matching
with the stored videos descriptors. The query stage
consists of three main steps. The first step is
compute the global dominate SIFT descriptor for
the input query image (similar as step-A2).
The second step is descriptor matching to
find the best matches with in the stored descriptors
in the database. We adapt the Brute Force Matcher
with L2-norm distance for finding a best matches
[24, 27]. Finally, we display the retrieved videos
ordered by its matching rate.
4.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑣𝑖𝑑𝑒𝑜𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑞𝑢𝑒𝑟𝑦 𝑣𝑖𝑑𝑒𝑜𝑠
2
On the considered video database four
query groups with total 40 queries are fired. The
first group consists of 10 original query images, see
Table 3. Then some modifications (e.g. crop, rotate
left and right) carried out to the original images.
The objective is to measure the accuracy of the
proposed methodology and to assert its
effectiveness in different cases.
EXPERIMENTS AND RESULTS
This section presents the experimental
settings including the dataset, evaluation criteria for
video matching, and the experimental evaluations.
A prototype was implemented to test the
proposed framework using OpenCV [28] and
5028
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
E-ISSN: 1817-3195
Table 2: Description Of Test Videos
Video
no.
Video name
Duration
Size
MB
No.Of
Frames
Aspect
ratio
Frame
rate
(FPS)
Resolution
(W × H)
(pixels)
1
Economist_V1
00:01:30
9.17
2160
16:9
24
854 × 480
2
Economist_V21
00:04:37
26.1
8310
16:9
30
854 × 480
3
Economist_V22
00:02:20
13.1
2400
4:3
30
720 × 480
4
Economist_V31
00:00:29
2.11
870
16:9
30
640 × 360
5
Economist_V32
00:02:47
15.9
5010
4:3
30
528 × 480
6
Economist_V33
00:00:29
11
870
16:9
30
640 × 360
7
Economist_V41
00:05:30
98
9900
16:9
30
280 × 720
8
Economist_V42
00:04:04
62.5
7320
16:9
30
280 × 720
10
Economist_V51
00:01:27
20.9
2610
16:9
30
280 × 720
11
Economist_V52
00:01:00
5.48
1800
16:9
30
280 × 720
12
Economist_V53
00:02:21
14.4
4230
16:9
30
854 × 480
13
Economist_V61
00:02:52
17.4
5160
4:3
30
640 × 360
14
Economist_V62
00:01:19
21.8
2370
16:9
30
720 × 480
15
Economist_V63
00:01:20
8.88
2400
4:3
30
280 × 720
16
Economist_V71
00:14:09
84.7
25470
4:3
30
720 × 480
17
Economist_V72
00:03:46
22.6
6780
4:3
30
720 × 480
18
Economist_V81
00:19:47
118
35610
4:3
30
720 × 480
19
Time_V1_1
00:01:15
7.6
2250
4:3
30
720 × 480
20
Time_V1_2
00:04:17
25.6
7710
4:3
30
720 × 470
21
Time_V1_3
00:02:52
17
5160
16:9
30
720 × 470
22
Time_V2
00:04:00
22.5
7200
16:9
30
280 × 720
23
Time_V3_1
00:01:30
9.33
2250
16:9
25
854 × 480
24
Time_V3_2
00:01:30
9.33
2250
16:9
25
854 × 480
25
Time_V4_1
00:04:37
27.8
8220
4:3
30
720 × 470
26
Time_V4_2
00:12:20
73.3
22200
4:3
30
720 × 470
27
Time_V4_3
00:02:55
21.2
5250
16:9
30
280 × 720
28
Time_V5
00:02:29
15.8
4470
4:3
30
720 × 480
29
Time_V6_1
00:03:02
18.2
5460
4:3
30
720 × 470
30
Time_V6_2
00:03:06
44.2
5580
16:9
30
280 × 720
5029
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
Table 3: Examples Of The Query Images
Groups
Query Images
Group 1
(original
Image)
Group 2
(Cropped
Image)
Group 3
(Rotate
Right
Image)
Group 4
(Rotate
Left
Image)
5030
E-ISSN: 1817-3195
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
Table 4 shows the comparative results.
The results demonstrate that, the proposed
descriptor achieved an average accuracy of 0.775
with respect to the other compared descriptors.
Moreover, the results indicated that, the proposed
E-ISSN: 1817-3195
descriptor has high accuracy in case rotation left
and right, as shown in figure 2. This is explained as
the reduction of the false matches that is issued by
SIFT.
Table 4: The Accuracy Of Different Descriptors
Descriptor
Query Groups
Group1 (Original Image)
Group 2 (Cropping)
Group 3 (Rotate Left)
Group 4 (Rotate right)
Average
SIFT [8]
Dominate SIFT [19]
Accuracy
1
0.7
0.6
0.5
0.7
Accuracy
1
0.5
0.3
0.4
0.55
Global Dominate
SIFT (proposed)
Accuracy
1
0.7
0.8
0.6
0.775
1.2
1
Accuracy
0.8
SIFT [8]
0.6
Dominate SIFT [25]
0.4
Global Dominate SIFT
(proposed)
0.2
0
Group 1
(Original
Image)
Group 2
(Cropping)
Group 3
(Rotate Left)
Group 4
(Rotate right)
Figure 2: Descriptor Accuracy Evaluations for Original SIFT, Dominate SIFT and Our Proposed Global Dominate SIFT
4.3 Efficiency Evaluation
Reducing the storage space and increasing
the retrieval speed are very important criteria for
any video retrieval system. Therefore, the
efficiency of the proposed descriptor is evaluated
by computing the Average Retrieved Time (ART)
and the required storage space. Table 5 shows the
ART of different descriptors. The results
demonstrate that, the proposed descriptor achieved
a low ART value, 10.2 seconds, with respect to the
other compared descriptors. Therefore, the
proposed descriptor can considered as a promising
solution for online applications. It is important to
note that those results depend on the computational
power of the target mobile device.
5031
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
E-ISSN: 1817-3195
Table 5: The Average Retrieved Time (ART) Of Different Descriptors
Descriptor
SIFT [8]
Dominate SIFT (3) [19]
ART (in second)
47.87
ART (in second)
19.77
GD-SIFT
(The proposed)
ART (in second)
11.05
Group 2 (cropping)
52.16
18.23
9.73
Group 3 (Rotate Left)
48.29
16.34
10.41
Group 4 (Rotate right)
48.17
18.19
9.59
Average
49.12
18.13
10.20
Query Groups
Group 1 (Original Image)
55
50
SIFT [8]
45
Time (second)
40
Dominate SIFT(3) [25]
35
Global Dominate SIFT
(proposed)
30
25
20
15
10
5
0
Group 1 (Original Group 2 (cropping )
Image)
Group 3 (Rotate
Left)
Group 4 (Rotate
right)
Figure 3: Descriptor ART Evaluations For Original SIFT, Dominate SIFT and Our Proposed Global Dominate SIFT
The global dominate SIFT uses 16 bits to
represent each key-point which is 8 times and 3
times more compact than the original SIFT [8] and
the Dominate SIFT(3) [19], respectively.
As shown in figure 4, the actual required
space to store all the video attributes and
descriptors are 212 MB for the global dominate
SIFT, 610MB for dominate SIFT (3) and 1620MB
for the original SIFT. Video attributes include code,
name, resolution, size and location on the disc.
Size (MB)
1800
1600
1400
1200
1000
800
600
400
200
0
SIFT
Dominat SIFT
GD‐SIFT
Figure 4: Descriptor Memory Evaluations For Original
SIFT, Dominate SIFT and Our Proposed GD-SIFT.
5032
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
www.jatit.org
4.3 Discussion
The proposed GD-SIFT requires only 16
bits to represent each key-point. Moreover, the
time-constraint cluster algorithm was adopted to
group the similar descriptors. For each cluster, the
representative (global) descriptor was constructed
by preserving the key-points that appear in all
descriptors within the cluster. As shown in table 1,
the GD-SIFT differs from other methods in the keypoints extraction and description steps. It is worth
mention that, the GD-SIFT is very suitable for
video indexing and retrieving applications.
Although the proposed descriptor requires
less storage, the required storage should decreased
to be more suitable for a real application. For
achieving better accuracy, motion features should
be considered. Motion features are important for
video indexing and retrieval. Extract moving
objects, distinguish between camera motion,
foreground motion and background motion.
Combine motion features and static features are
important for video indexing and retrieval.
5.
[3]
[4]
[5]
[6]
[7]
[8]
CONCLUSIONS
In this paper, we proposed an efficient and
effective video indexing and retrieving framework.
This framework is based on a new compact
descriptor which called GD-SIFT. The GD-SIFT
descriptor used 16 bits to represent each key-point.
Our experimental result shows that, the GD-SIFT
descriptor achieved a high accuracy (an average of
78%) and more efficient (in terms of speed and
storage) with respect to the related works.
Moreover, the results indicated that, the proposed
descriptor is more robust to variations (e.g. Scale,
rotation, etc.). Importantly, the proposed descriptor
is suitable for online and real-time applications and
no need any vocabulary nor training.
REFERENCES:
[1] M. Ravinder and T. Venugopal, "Content
Based Video Indexing and Retrieval Using
Key Frames Discrete Wavelet Center
Symmetric
Local
Binary
Patterns
(DWCSLBP)," International Journal of
Computer Science and Information Security,
vol. 14, no. 5, p. 699, 2016.
[2] K. Uma, B. Shekar, and M. Smitha, "Video
clip retrieval: An integrated approach based
on KDM and LBPV," in Advances in
Computing, Communications and Informatics
[9]
[10]
[11]
[12]
[13]
5033
E-ISSN: 1817-3195
(ICACCI), 2017 International Conference on,
2017, pp. 1613-1618: IEEE.
M.-H. Park and R.-H. Park, "EFFICIENT
VIDEO INDEXING FOR FAST-MOTION
VIDEO," International Journal of Computer
Graphics & Animation, vol. 4, no. 2, p. 39,
2014.
S. Kaavya and G. LakshmiPriya, "Multimedia
Indexing and Retrieval: Recent research work
and their challenges," in Signal Processing,
Communication and Networking (ICSCN),
2015 3rd International Conference on, 2015,
pp. 1-5: IEEE.
M. P. Chivadshetti, M. K. Sadafale, and M. K.
Thakare, "Content Based Video Retrieval
Using Integrated Feature Extraction."
I. Ihrke, K. N. Kutulakos, H. P. Lensch, M.
Magnor, and W. Heidrich, "State of the art in
transparent
and
specular
object
reconstruction," in EUROGRAPHICS 2008
STAR–STATE OF THE ART REPORT, 2008:
Citeseer.
G. M. Farinella, S. Battiato, and R. Cipolla,
Advanced topics in computer vision. Springer,
2013.
D. G. Lowe, "Distinctive image features from
scale-invariant keypoints," International
journal of computer vision, vol. 60, no. 2, pp.
91-110, 2004.
Y. Ke and R. Sukthankar, "PCA-SIFT: A
more distinctive representation for local
image descriptors," in Computer Vision and
Pattern Recognition, 2004. CVPR 2004.
Proceedings of the 2004 IEEE Computer
Society Conference on, 2004, vol. 2, pp. II-II:
IEEE.
H. Bay, A. Ess, T. Tuytelaars, and L. Van
Gool, "Speeded-up robust features (SURF),"
Computer vision and image understanding,
vol. 110, no. 3, pp. 346-359, 2008.
N. Dalal and B. Triggs, "Histograms of
oriented gradients for human detection," in
Computer Vision and Pattern Recognition,
2005. CVPR 2005. IEEE Computer Society
Conference on, 2005, vol. 1, pp. 886-893:
IEEE.
J. He, S.-F. Chang, R. Radhakrishnan, and C.
Bauer, "Compact hashing with joint
optimization of search accuracy and time," in
Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on, 2011, pp.
753-760: IEEE.
B. Girod, V. Chandrasekhar, R. Grzeszczuk,
and Y. A. Reznik, "Mobile visual search:
Architectures, technologies, and the emerging
Journal of Theoretical and Applied Information Technology
15th October 2019. Vol.97. No 19
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
www.jatit.org
MPEG standard," IEEE MultiMedia, no. 3,
pp. 86-94, 2011.
M. Calonder, V. Lepetit, C. Strecha, and P.
Fua, "Brief: Binary robust independent
elementary features," in European conference
on computer vision, 2010, pp. 778-792:
Springer.
A. Alahi, R. Ortiz, and P. Vandergheynst,
"Freak: Fast retina keypoint," in 2012 IEEE
Conference on Computer Vision and Pattern
Recognition, 2012, pp. 510-517: Ieee.
V. Chandrasekhar, G. Takacs, D. Chen, S.
Tsai, R. Grzeszczuk, and B. Girod, "CHoG:
Compressed histogram of gradients a low bitrate feature descriptor," in Computer Vision
and Pattern Recognition, 2009. CVPR 2009.
IEEE Conference on, 2009, pp. 2504-2511:
IEEE.
J. Chen, L.-Y. Duan, R. Ji, and Z. Wang,
"Multi-stage vector quantization towards low
bit rate visual search," in Image Processing
(ICIP), 2012 19th IEEE International
Conference on, 2012, pp. 2445-2448: IEEE.
A. T. Tra, W. Lin, and A. Kot, "Dominant
SIFT: A novel compact descriptor," in
Acoustics, Speech and Signal Processing
(ICASSP),
2015
IEEE
International
Conference on, 2015, pp. 1344-1348: IEEE.
A. T. Tra, W. Lin, and A. Kot, "Dominant
SIFT: A novel compact descriptor," in 2015
IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP),
2015, pp. 1344-1348: IEEE.
R. E. G. Valenzuela, W. R. Schwartz, and H.
Pedrini, "Dimensionality reduction through
PCA over SIFT and SURF descriptors," in
Cybernetic Intelligent Systems (CIS), 2012
IEEE 11th International Conference on, 2012,
pp. 58-63: IEEE.
A. E. Abdel-Hakim and A. A. Farag, "CSIFT:
A SIFT descriptor with color invariant
characteristics," in 2006 IEEE Computer
Society Conference on Computer Vision and
Pattern Recognition (CVPR'06), 2006, vol. 2,
pp. 1978-1983: Ieee.
V. Chandrasekhar et al., "Survey of SIFT
compression schemes," in Proc. Int.
Workshop Mobile Multimedia Processing,
2010, pp. 35-40.
D. Nister and H. Stewenius, "Scalable
recognition with a vocabulary tree," in
Computer vision and pattern recognition,
2006 IEEE computer society conference on,
2006, vol. 2, pp. 2161-2168: Ieee.
E-ISSN: 1817-3195
[24] Z. Pusztai and L. Hajder, "Quantitative
comparison of feature matchers implemented
in OpenCV3," 2016.
[25] H. Farouk, K. ElDahshan, and A. A. E.
Abozeid, "Effective and Efficient Video
Summarization
Approach
for
Mobile
Devices," International Journal of Interactive
Mobile Technologies (iJIM), vol. 10, no. 1,
pp. 19-26, 2016.
[26] H. Karray, M. Ellouze, and A. Alimi,
"Indexing video summaries for quick video
browsing," in Pervasive Computing, Springer,
2010, pp. 77-95.
[27] J. T. Arnfred and S. Winkler, "Fast-Match:
Fast and robust feature matching on large
images," in 2015 IEEE International
Conference on Image Processing (ICIP),
2015, pp. 3000-3003: IEEE.
[28] OpenCV.
(3/2019).
OpenCV
library.
Available: https://opencv.org/
[29] FFmpeg.
(3/2019).
FFmpeg
Library.
Available: https://ffmpeg.org/
[30] A. Araujo, J. Chaves, D. Chen, R. Angst, and
B. Girod, "Stanford I2V: a news video dataset
for
query-by-image
experiments,"
in
Proceedings of the 6th ACM Multimedia
Systems Conference, 2015, pp. 237-242:
ACM.
5034