S-SIFT: A Simple SIFT Algorithm with High
Efficiency
Yixue Wang1,2, Yujie Huang*1,2, Liyuan Peng1,2, Mingyu Wang*1, Wenhong Li1,
Minge Jing1, Xiaoyang Zeng*1
1
State Key Lab of ASIC & System, Fudan University, 825 Zhangheng Rd, 200433, Shanghai, China
2
Shanghai ExploreX Technology Co., Ltd., 188 Shengrong Rd, 200120, Shanghai, China
* Email: { mywang, huangyj19, xyzeng}@fudan.edu.cn
2024 IEEE 17th International Conference on Solid-State & Integrated Circuit Technology (ICSICT) | 979-8-3503-6183-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICSICT62049.2024.10831801
Abstract—Finding distinctions and connections between at low level, and they are largely dependent on the framework
multiple visual targets through the detection of keypoints has and dataset. At the same time, the computational overhead is
become one of the research hot-spots in the field of computer high, and most deep learning solutions run in poor real-time
vision. SIFT has received wide recognition and attention for its and require GPU acceleration.
powerful performance in image match. However, the original
SIFT has high complexity and time-consuming problems. In this SIFT for keypoints detection has been proven to be
paper, we propose Simple SIFT (S-SIFT), an improved SIFT successful in many applications and is still one of the most
algorithm in which the construction of Gaussian pyramid and popular and powerful algorithms today. Due to the complexity
the order of detection steps have been changed. Through the of the algorithm, the speed of operation has been limited.
collaborative improvement and optimization on the scale space Some processes in the sift algorithm were found to be
and algorithm framework, a more efficient local keypoints unnecessary in our experiments, and framework is still worth
detecting method is achieved. Experiments demonstrat that S- investigating. This paper will address the executability of
SIFT guarantees satisfactory accuracy with a significant some processes as well as propose new frameworks and
reduction in the computational overhead. Meanwhile, S-SIFT is algorithms based on SIFT. The aim is to simplify the
robust to common image deformations such as rotation, scale computation as much as possible without compromising its
change and affine. performance and improve the execution efficiency of the
algorithm.
Keywords—SIFT, S-SIFT, keypoints detection, image match,
computer vision II. REVIEW OF THE SIFT ALGORITHM
I. INTRODUCTION SIFT (Scale Invariant Feature Transform), which
Keypoint detection is a basic and critical process in the transforms image data into scale invariant coordinates with
field of vision, which perceives and connects two image the local features. The computational stages for generating
targets with the same or similar attributes. The underlying image keypoints are as follows.
layer of important applications such as simultaneous A. Scale-space extrema detection
localization, maping, image stitching, and object recognition
etc., are all based on keypoints being reliably extracted and Detecting scale-space extremes involves first calculating
matched across images. image locations on all scales by constructing Gaussian
pyramid and difference of Gaussian(DOG) pyramid.
SIFT algorithm [1] has been proposed by Canadian
professor David G. Lowe in 1999 and been perfected in 2004 The Gaussian pyramid of an image is defined as the
[2], which maintains good stability to image transformation. function 𝐿(𝑥, 𝑦, 𝜎), which is generated by the convolution of
Along the SIFT algorithm, Bay first proposed the SURF a variable scale Gaussian function 𝐺(𝑥, 𝑦, 𝜎)with the input
algorithm in 2006 [3]. SURF alleviates the shortcomings of image 𝐼 (𝑥, 𝑦):
SIFT's high computational complexity and time-consuming. 𝐿(𝑥, 𝑦, 𝜎 ) = 𝐺(𝑥, 𝑦, 𝜎) ∗ 𝐼 (𝑥, 𝑦) (1)
However, the increase in speed is accompanied by the decline
in performance especially in scale and rotation changes. ORB For each octave of the Gaussian pyramid, images are
proposed in 2011 [4], locates the corner points by FAST repeatedly convolved with the Gaussian function to produce
algorithm and uses gray scale center of mass method to obtain the set of gray scale space images shown in Fig.1(a). After the
the orientation. The descriptor uses a 128-bit binary more last layer of current octave is generated, the Gaussian image is
concisely than sift. These greatly speed up the keypoints downsampled by a factor of 2 to cte a new octave and the
description computation and matching, but the false match process is repeated until the build is completed.
rate increases at the same time. Keypoint features of images Point 𝐷(𝑥, 𝑦, 𝜎) in DOG pyramid can be calculated from
have been widely used due to their excellent properties, and the difference between two neighboring Gaussian images.
subsequently many improved algorithms [5-8] have emerged,
such as the GLOH algorithm [5], the Daisy algorithm [6], the 𝐷(𝑥, 𝑦, 𝜎) = (𝐺(𝑥, 𝑦, 𝑘𝜎) − 𝐺(𝑥, 𝑦, 𝜎)) ∗ 𝐼(𝑥, 𝑦)
AB-SIFT algorithm [7], and the PCA-SIFT algorithm [8], etc. = 𝐿(𝑥, 𝑦, 𝑘𝜎) − 𝐿(𝑥, 𝑦, 𝜎) (2)
Methods based on deep learning, such as D2Net[9] and As shown in Fig.1(b), neighboring Gaussian images are
Patch2Pix[10], show excellent performance in keypoints subtracted to obtain the DOG image on the right.
detection. Deep learning largely avoids the requirement for
human experience and prior knowledge. However, the image Compare the middle point with the other 3×3×3-1 points
information involved in deep learning such as gradient is still within neighborhood of the same scale and the corresponding
979-8-3503-6183-4/24/$31.00 ©2024
Authorized licensed use limited IEEE
to: National Cheng Kung Univ.. Downloaded on March 20,2025 at 05:52:56 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. S-SIFT block diagram
higher accuracy of matching and more uniform distribution of
(a)Gaussian pyramid and DOG pyramid keypoints are far more important than the number of keypoints.
Up-sampling brings a huge amount of computation and
increased time consuming, but it does not bring better
matching accuracy and efficiency.
For the top level of the pyramid, the image resolution is
too small to extract matchable keypoints. It is clearly
unnecessary to bring in calculations to get more layers of the
(b)Gaussian pyramid construction pyramid. Thus, The number of Gaussian pyramid groups 𝑂 is
Fig. 1. The Construction of Scale Space set as:
𝑂 = log 2 𝑚𝑖𝑛(𝑀, 𝑁) − 4 (1)
neighborhoods of adjacent scales to successfully obtain the where M, N are the rows and columns of the original image.
extreme points in the 2D image space and scale space, so that Make the top level image width and height must be at least 32
the keypoints detected through the initial localization have pixels.
both positional coordinates and scale coordinates.
B. Algorithm Framework
B. Keypoint localization Caculating by Taylor expansion after getting extreme
When the extreme points are detected, a problem can be points leads to more accurate location of keypoints, but it also
found that the extreme points are not accurate enough as the causes a significant computation consumption. The stable
Gaussian difference pyramid is discrete. It is likely that the extreme points obtained in the Gaussian pyramid are actually
true extreme point is nearby, and the Taylor expansion needs sufficient to some extent for the alignment of the two images.
to be utilized in order to find the extreme point with higher In order to obtain the stable keypoints, it is still need to do
sub-pixel positional accuracy. Points with low contrast and on further refinement upon extreme points, which mainly
edges are determined as the unstable extreme points which includes inhibition of low contrast points and removal of edge
should be inhibited. response points. Since there is no need to position the
keypoints with Taylor expansion, points with low contrast
C. Orientation assignment
should be removed out in advance to reduce subsequent
Orientations are assigned to each keypoint location based calculations, which can be done before getting extreme points
on the local image gradient orientation. All future operations in the DOG pyramid.
are performed on image data that is transformed relative to the
orientation, scale, and position specified for each feature. Unlike the original SIFT, S-SIFT changes the order in
detecting keypoints. Fig 2 shows the S-SIFT block diagram.
D. Keypoint descriptor After constructing the DOG pyramid, the points 𝐷(𝑥, 𝑦, 𝜎)
The keypoint descriptor is created by calculating the less than the contrast threshold can be filtered away. Since low
gradient magnitude and direction for each image sample point contrast points have been inhibited, unnecessary calculations
in the region surrounding the keypoint location. in the subsequent process are effectively diminished.
Changing the order of algorithm steps not only improves the
III. ALGORITHM speed of algorithms in software, but also suppresses redundant
A. Construction of Scale Space signal activity in hardware designs. Then extreme points and
edge effect detection are carried out in the remaining points.
The process of constructing the scale space for SIFT Orientation assignment and keypoint descriptor is consistent
algorithm can be summarized as Gaussian kernel convolution with the original SIFT.
and downsampling. The purpose is to simulate the degree of
blurring and size of objects in vision to mitigate the effect of IV. EXPERIMENTS
scale on keypoints extraction. David G. Lowe upsamples the In our experiments, compared methods include SIFT and
original image to have better results. However the process of SURF. Meanwhile, we use the code and the same set of
up-sampling is not essential for normal images. Up-sampling parameters provided by the OpenCV to obtain the results for
using bilinear interpolation expands the width and height of comparison. In S-SIFT, we set the contrast threshold to 1 and
the image, resulting in an image that is not precise enough to the rest of parameters are the same as in the original sift. Three
extract keypoints. At the same time, it creates a lot of groups of images(Fig.3) are used for our tests, where the right
duplication of keypoints with the original image scale, which reference image is obtained by transforming the left one. In
leads to a dramatic increase in the number of keypoints in the the transformations, we added rotations, scale scaling and
up-sampled image. The repetition of keypoints results in a affine to test. The resolution of the picture is 1920×1080,
sharp drop in the final match rates. For computational vision, 3840×2160 and 2000×1500 in sequence. We use KNN [11]
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on March 20,2025 at 05:52:56 UTC from IEEE Xplore. Restrictions apply.
which is less than a third of SIFT accuracy in Group3. At the
same time, the KNN matching time increases dramatically due
to the large number of keypoints. In applications such as
image stitching and object recognition, a few hundred
keypoints are enough to achieve good results. While the
number of keypoints is reduced, S-SIFT's computation and
matching times drop significantly, and its accuracy rises by
few percentage points.
V. CONCLUSION
Aiming at high computation complexity and difficulty for
practical use of the original SIFT , we propose a simpler and
more efficient algorithm S-SIFT for keypoint detection.
Through experiments we find that the upsampling process is
actually unnecessary for generally large image resolutions.
Meanwhile, S-SIFT remove the step of Taylor expansion and
changed the sequence of steps. Through collaborative
improvement and optimization on construction of scale space
and algorithm framework, the performance of the algorithm
Fig. 3. Images Used for Testing has been improved significantly. Experiments demonstrate
that the matching accuracy of S-SIFT is improved by few
TABLE I. THE PERFORMANCE COMPARISION
percentage points compared with sift, and S-SIFT is more
Time
Keypoints
Matches
Match robust to image transformations. While ensuring satisfactory
num correctness accuracy, the computation time for extracting keypoints is
4.566s 7731 reduced to less than a quarter of the original, and the KNN
Group1+SIFT 4234 45.08%
7.108s 11052
matching time can even be dropped substantially to 10% in
Group1+SURF
1.61s 13713
4959 34.95% Group3. However, the work in this paper leaves something to
2.464s 18189 be desired. The generation of descriptors requires the
1.283s 3202 utilization of information around keypoints, which is bound to
Group1+S-SIFT 1961 49.90%
1.974s 4657 demand a large amount of time consuming and storage.
12.554s 10606 Further optimization can be done for the generation of
Group2+SIFT 3210 31.46%
17.849s 9903 keypoint descriptors.
4.957s 30829
Group2+SURF 6902 21.35% REFERENCES
6.504s 33823
[1] Lowe D G. Object Recognition from Local Scale-Invariant Features[C].
3.094s 5710 Proceedings of the International Conference on Computer Vision,1999.
Group2+S-SIFT 1996 34.20%
3.935s 5964
[2] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints
10.223s 24505 [J]. International Journal of Computer Vision, 2004, 60(2):91-110.
Group3+SIFT 9833 36.10%
13.269s 29973 [3] Bay H, Ess A, Tuytelaars T and Van Gool L. Speeded-Up Robust
5.049s 57397 Features (SURF)[J]. Computer Vision and Image Understanding, 2008,
Group3+SURF 5587 10.19% 110(3):346-359.
5.485s 52222
[4] Rublee E, Rabaud V, Konolige K and Bradski G. ORB: an efficient
2.891s 8429 alternative to SIFT or SURF [C]. International Conference on
Group3+S-SIFT 3802 43.88%
3.255s 8901 Computer Vision(ICCV),2011: 2564-2571.
[5] Mikolajczyk K, Schmid C. A Performance Evaluation of Local
TABLE II. KNN TIME CONSUMPTION Descriptors[J]. IEEE Transactionson Pattern Analysis and Machine
Intelligence,2005,27(10):1615-1630.
SIFT SURF S-SIFT [6] Tola E, Lepetit V, Fua P. Daisy:An Efficient Dense Descriptor Applied
Group1 2.732s 3.464s 0.373s to Wide Baseline Stereo[J]. IEEE Transactions on Software
Group2 2.863 14.19s 0.783s Engineering,2010,32(5):815-830.
Group3 21.687s 43.146s 1.704s [7] Sedaghat A, Ebadi H. Remote Sensing Image Matching Based on
Adaptive Binning SIFT Descriptor[J]. IEEE Transactions on
for matching, which is the most widely used algorithm. Since Geoscience and Remote Sensing,2015,53(10):5283-5293.
the transformation matrix of each image is already known, we [8] Ke N Y, Sukthankar R. PCA-SIFT:A More Distinctive Representation
can get the correct matching point by calculation. It is for Local ImageDescriptors[C]. Proceedings of the 2004 IEEE
Computer Society Conference on Computer Visionand Pattern
considered that the matching points are correct if the detected Recognition,2004.
point is within 5 pixels of the correct match. [9] Dusmanu M et al. D2-Net: A Trainable CNN for Joint Detection and
The proposed S-SIFT, the original SIFT and SURF method Description of Local Features. Proceedings of the IEEE Conference on
are respectively adopted to conduct matching experiment,. Computer Vision and Pattern Recognition(CVPR), 2019.
The results are shown in TABLE I and TABLE II. Match [10] Zhou Q, Sattler T and Leal-Taixe L. Patch2pix: Epipolar-guided pixel-
level correspondences. Proceedings of the IEEE/CVF conference on
correctness is calculated as the number of matching points computer vision and pattern recognition, 2021:4669-4678.
divided by the average of keypoints number in two original [11] Abeywickrama T, Cheema MA and Taniar D. k-Nearest Neighbors on
images. Road Networks: A Journey in Experimentation and In-Memory
Although SURF can get a large number of keypoints in a Implementation. Proceedings of the VLDB Endowment,2016(9):492-
503.
short time, the accuracy is much smaller than other algorithms,
Authorized licensed use limited to: National Cheng Kung Univ.. Downloaded on March 20,2025 at 05:52:56 UTC from IEEE Xplore. Restrictions apply.