0% found this document useful (0 votes)

13 views12 pages

Learning Detect Match

This paper presents a novel approach for detecting and describing keypoints in images using deep learning architectures, moving away from traditional hand-engineered features. The authors introduce a large-scale dataset of over 2.5 million matching image patches and propose a model that learns to identify multiscale keypoints and their descriptors. The model's effectiveness is evaluated through both quantitative and qualitative assessments, demonstrating its capability in keypoint detection and matching tasks.

Uploaded by

btsdsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Learning Detect Match

Uploaded by

btsdsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 1

Learning to Detect and Match Keypoints

with Deep Architectures
Hani Altwaijry1,2 1
Cornell University
haltwaijry@cs.cornell.edu Ithaca, NY, USA
Andreas Veit1,2 2
Cornell Tech
aveit@cs.cornell.edu New York, NY, USA
Serge Belongie1,2
sjb344@cornell.edu

Abstract
Feature detection and description is a pivotal step in many computer vision pipelines.
Traditionally, human engineered features have been the main workhorse in this domain.
In this paper, we present a novel approach for learning to detect and describe keypoints
from images leveraging deep architectures. To allow for a learning based approach, we
collect a large-scale dataset of patches with matching multiscale keypoints. The pro-
posed model learns from this vast dataset to identify and describe meaningful keypoints.
We evaluate our model for the effectiveness of its learned representations for detecting
multiscale keypoints and describing their respective support regions.

1 Introduction
The extraction of effective features is a key step in many machine learning and computer
vision algorithms and their applications. In computer vision, one form of feature extraction is
concerned with the detection and description of important image regions. Traditionally, these
features are extracted using hand engineered detectors and descriptors. Approaches adopting
this paradigm are generally referred to as keypoint-based or feature-based approaches.
Recently, the reintroduction of neural networks into many computer vision tasks broadly
replaced hand-engineered feature-based approaches. Neural network based approaches gen-
erally learn the feature extraction as part of an end-to-end pipeline. While these approaches
have shown great success in tasks such as scene recognition, object detection and classifica-
tion, other tasks such as structure-from-motion still depend on purely engineered features,
e.g. SIFT [18], to detect and describe keypoints.
In this paper, we propose a model that learns what constitutes a good keypoint, is capable
of capturing keypoints at multiple scales and learns to decide whether two keypoints match.
We achieve multiscale keypoint detection with a fully-convolutional network that recursively
applies convolutions to regresses keypoint scores. With each successive convolution, the
network evaluates image patches, i.e., keypoints, at a larger scale. By extracting the keypoint
feature map after each convolution we obtain a feature map that resembles a keypoint scale-
space. To learn descriptors for keypoint matching, we leverage a triplet network to learn an
c 2016. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS

Input Image Keypoint Responses Points

58, 114
86, 156
Fully-Convolutional 42, 338
Recursive Network 248, 88
486, 36

Non-Maximum-Suppression
Descriptors

Patch Descriptor
Extracted Patches
Network

Figure 1: Proposed architecture for learning to detect and describe keypoints at multiple-
scales. Given an image, a fully-convolutional recursive network outputs a scale-pyramid of
keypoint responses, which are used to extract patches. Then, the patches are described by a
patch descriptor network.

embedding where patches of matching keypoints are closer to each other than non-matching
patches. Figure 1 provides an overview of our proposed model.
There is currently no large-scale dataset for learning both keypoint detectors and de-
scriptors from image patches. Furthermore, finding training examples to train deep neural
networks for this task poses a serious challenge, as collecting human annotated examples
would be prohibitively expensive. Therefore, we create our own dataset by following a self-
supervised approach, where we utilize structure-from-motion to build a large database of
keypoints and matching image patches. Although those feature matches were determined
originally with engineered features, structure-from-motion also factors in the underlying ge-
ometry. We only consider those keypoints that went through rigorous geometric filtering,
which allows the learning of features that extend upon their engineered counterparts.
To create our supervisory examples, we collect a dataset of aerial oblique imagery and
construct a large-scale model of 1.3 million 3D points using VisualSFM [28, 29]. We used
those 3D points to extract matching patches exhibiting varying photometric differences, in-
cluding: scale, illumination, perspective. Those patches formed the basis from which our
deep neural network model was able to learn to detect and match keypoints.
We evaluate the proposed model both quantitatively and qualitatively and show that it is
capable of identifying keypoints at multiple scales as well as matching them.
Our main contributions are:
1. We propose a novel approach capable of learning to detect multiscale keypoints and
descriptors for effective correspondence matching.
2. We introduce a large-scale dataset composed of over 2.5 million matching image
patches at varying scales.

2 Related Work
2.1 Feature Extraction and Description
The computer vision literature has served up a large number of engineered feature extrac-
tors and descriptors, such as SIFT [18], HOG [9], SURF [4], ORB [7], and BRISK [16].
These extractors and descriptors were designed with multiple goals in mind, such as opti-
mizing for matching accuracy or extraction and matching speeds. In general, they have been
ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 3

demonstrated to perform well in various applications of computer vision. Furthermore, the

literature has seen approaches that learn keypoint detectors [13, 27] and descriptors [3, 6, 24].
We contrast this work by striving to learn both the keypoint detector and descriptor.
In correspondence matching problems, descriptors are used to find geometrical relation-
ships between two or more sets of keypoints, which are then filtered by imposing geometrical
constraints through model fitting techniques such as RANSAC [10]. Structure-from-motion
solutions, e.g. VisualSFM [28, 29], start with correspondence matching, and expand the
computed relationships to many images building a global model governing all. In this work,
we leverage the compounded effect of geometry on engineered features to provide our su-
pervisory signal.

2.2 Deep-learning and Matching Images

In recent years, the computer vision literature has seen a surge of state-of-the-art results, on
all fronts, surfacing from research on Deep Convolutional Neural Networks [11, 15, 25, 26].
In [12, 23, 30], deep architectures were proposed to learn feature descriptors. The
siamese architecture [5] forms the basis for these approaches, with the neural networks learn-
ing to embed 64×64 patches in a feature space where matching patches are closer to each
other than non-matching patches. Their supervisory signal is based on structure-from-motion
patches originally used in [6]. However, they do not learn keypoint detection, and do not han-
dle various scales natively. We build on these approaches by showing how to create a model
that learns to predict the keypoints and their respective descriptors at various scales.
The detection of salient regions with deep architectures has been mostly discussed within
the object detection and recognition literature. In [17], features in later layers were shown
to correspond to fine details in the receptive fields covered by those features. One approach
to generate salient region proposals are visual attention models, e.g., [2, 20]. There, a re-
current network is trained to examine and propose regions of the image space sequentially.
Attention mechanisms typically learn the salient features in an unsupervised manner. One
particular approach is the Spatial Transformer Network [14] which describes a region pro-
posal scheme capable of highlighting regions with associated transformations to a canonical
pose that is learned automatically. In [1], spatial transformer networks are used to detect a
fixed number of probable patch matches. In essence, there the network attempts to detect
and match patches simultaneously, with only weak-supervision from match/no-match labels
on the image level. Another approach to generate region proposals are Region Proposal Net-
works [11] that have the sole purpose of identifying regions of the image space that contain
objects. Region proposal networks are generally trained in a fully supervised manner. This
approach has been recently extended [21] by coupling the proposal network with the classi-
fication network for faster performance. We draw inspiration from these works for modeling
a network capable of proposing keypoints.

3 Learning Model
The goal of this model is to learn to detect and match keypoints in images. We achieve this
by using two models, one for each task. The first is a keypoint detection network, and the
second is a keypoint description network.
4 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS
Aerial Images Covering 15x15 km2 3D Model with 3+ Million Points Matching
Patches
Extraction

Patches
SFM

Figure 2: Generating multiscale matching patches using structure-from-motion.

3.1 Training Data

To train the keypoint detection network and the keypoint matching network, we require a
large set of patches with high quality keypoints that are also annotated with pairwise match
information. Since no such large scale dataset currently exists, a key aspect of our data-
driven learning approach is the collected of a large scale training dataset. For that purpose,
we follow a self-supervised data collection scheme.
Our data generation approach is similar to that of [6]. We rely on structure-from-motion
techniques to identify good keypoints and to generate matching patches. However, our ap-
proach differs in that we keep the original patch sizes, without rescaling to a canonical size.
This allows us to train a multiscale detector.
We use aerial imagery covering an area of 15 × 15km2 around the city of Boston, Mas-
sachusetts to construct a 3D model using VisualSFM [28, 29]. The model contains 1.3
million 3D points where each is observed from at least two cameras, i.e. images. For a single
3D point with k associated keypoints, there are k(k − 1)/2 unique keypoint pairs that we can
extract as matching patch pairs. To generate patches that do not constitute good keypoints,
we randomly sample image patches that do not belong to any keypoints. Figure 2 gives an
overview of the approach.
The keypoint scales extracted from the 3D model are continuously valued. For our ap-
proach, we discretize the scale values into five scales: S = {64, 96, 128, 192, 256}. We deter-
mined the set of scales by clustering the scale ranges in the extracted dataset. As we show
later, this discretizion does not limit the model. The fixed scale range directly affects our
design, however only in one way: the smallest scale the model handles is 64 × 64. There is,
however, no limit on the largest scale. This is a result of the recursive architecture, which we
will discuss in the following subsection.
We denote the generated set of patches P as follows:

P = {pi : (xi , si , ki ); k ∈ {−1, 1}} (1)

where pi is a patch with: xi as the raw pixels, si is the scale of the patch, and ki is the keypoint
label. Further, we denote the generated set of matches M as

M = mi : (p j , pk , yi ); p j,k ∈ P, yi ∈ {−1, 1} (2)
where each match mi is tuple that references two patches p j and pk with yi being the match
true/false label.

3.2 Detection Network

The goal of the detection network is to identify the regions of the input image that constitute
good keypoints. Identifying keypoint regions includes both finding the optimal keypoint lo-
ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 5

...
.
.
.

Fully Connected
Keypoint?
3x3 Convolution
d x 1 x 1 Yes/No

3x3 Pooling
Branching
Weight-Sharing Convolution

Figure 3: Training architecture for multiscale keypoint detection network. First, patches pass
through a set of convolutions and pooling layers. Then, a recursive convolution is applied
until the feature map dimension is 1 × 1. Since a batch can contain patches of different
scales, a scale-dependent branch is chosen for each patch determining the number of recur-
sive convolutions. Finally, two fully connected layers lead into a binary keypoint classifier.

cations as well as their scales. In particular, we learn a nonlinear function f (X), from images
X into a feature space Rw×h , where high activations correspond to respective image regions
that constitute good keypoints. The architecture used for training the detection network dif-
fers slightly from the architecture used during inference, since it is trained on image patches,
but inference is performed on whole images.

3.2.1 Training Procedure

Figure 3 illustrates the training architecture. The input to the network are batches of patches
{pi } ⊂ P and associated binary labels indicating whether the patches represent good key-
points. In essence, the detection network is a binary classification CNN that learns to decide
whether a given patch constitutes a good keypoint or not. As such, it consists of a sequence
of convolutional and pooling layers followed by two fully-connected layers for classification.
Keypoints vary largely with respect to their scale and thus the patches come in many
different sizes. To address this, we propose a scale-dependent branching mechanism, shown
in Figure 3 by blue arrows. There, a scale-dependent branch is chosen for each patch. Within
each branch convolutional filters are applied recursively until the output feature is of dimen-
sion (d × 1 × 1). This allows for encoding keypoints of varying scales in a common feature
space of fixed size. This is essential for efficient multiscale inference. All convolutions
across all scale-dependent branches share the same weights, allowing for inference over ar-
bitrarily large input images. In essence, the recursive application of the same convolutional
filters resembles a rolled-out recurrent neural network for handling multiscale inputs. The d-
dimensional output from the recursive branches is then used to determine whether the patch
is centered around a good keypoint.
To sample training patches, we use hard-negative mining to improve the performance of
the keypoint detector. For each training batch, we randomly sample the dataset searching for
patches with high-loss, to construct batches of difficult examples. Each batch is chosen to
have a mix of positives and negatives with a 1:1 ratio.

3.2.2 Training Objective

We define a loss-function LKP for training the keypoint detection network. The loss function
comprises two terms. First, as we model keypoint detection as a binary classification prob-
6 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS
Input Image d x n x n
d x (n-2) x (n-2)
Keypoint
. scale-space
.
.

... d x 1 x 1

1x1 Convolution 3x3 Pooling

Fully Connected Indexing
3x3 Convolution Weight-Sharing Convolution

Figure 4: Inference architecture for multiscale keypoint detection network. First, an input
image is passed through a set of convolutions and pooling layers. Then, a recursive convo-
lution is applied until the feature map dimension is 1 × 1. After each recursive convolution
we compute the keypoint feature map. Since convolutions later in the network have larger
receptive fields the output feature maps resemble a keypoint scale-space.

lem, we make use of the hinge-loss to define our first term. Second, we use a squared differ-
ence loss to penalize network responses on non-centered patches. This employs a Gaussian-
like response around the center of the patch and is inspired by the response shape penalty
used in [27]:
kv j k2
−
h j = e 2σ 2 (3)

where v j is the vector from the keypoint towards the center of the patch. During training,
non-centered patches are generated by extracting patches jittered around the keypoints. This
results in data-augmentation as well as incentivizes maximum responses around the centers
of informative regions. Putting the two terms together, the joint loss function is given as
2
1
LKP = ∑ λ max 0, 1 − y j x j + (1 − λ ) x j − h j (4)
N j

with x j as network output, y j ∈ {−1, 1} as the training label, and λ as mixing-weight.

3.2.3 Inference
Figure 4 depicts the inference architecture, which differs slightly from the training archi-
tecture. During inference the network processes whole images, as opposed to patch-sized
inputs. However, we assume that input images are at least of size 64 × 64.
Instead of a single value describing the keypoint quality of a single patch, the network is
converted to be fully convolutional as to output a feature map. There, each value corresponds
to the keypoint score of a specific image region. In particular, we compute the keypoint
feature map after each recursive convolution. As inputs progress deeper into the network,
the receptive field of individual neurons increases so that larger patches in the input image
are considered. As a result, the output feature maps resemble a keypoint scale-space. We
illustrate this in Figure 5. This allows us to select the best scale for each patch by finding the
scale with the highest keypoint response score. Finally, the best keypoints are extracted with
non-maximum suppression.

3.3 Description Network

The goal of the descriptor network is to learn a nonlinear feature embedding f (p), from
patches p into a feature space Rd , such that for a pair of patches p1 and p2 , the Euclidean
ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 7

Anchor

||a-p||2
Deeper
Convolutions
Positive Triplet Loss
=
Larger
Scale ||a-n||2
= L2 Normalize
Larger Negative 3x3 Convolution
Patch
3x3 Pooling

Keypoint Response

Figure 5: Convolutions later in the de- Figure 6: Training architecture of the key-
tection network correspond to larger point description triplet network. Three
patch sizes. The keypoint feature map patches are passed through channels which
with the highest response indicates the share weights to rank their euclidean dis-
best keypoint scale. tances in the feature space.

distance between f (p1 ) and f (p2 ) is small if the patches match and is large if they do not
match. The training follows an approach similar to the triplet network proposed in [22].
In particular, the nonlinear embedding should ensure that a patch p1 (anchor) is closer to
all patches depicting the same keypoint p2 (positive) than it is to any other patch p3 (nega-
tive). Given the feature embedding and a sets of keypoints with respective patches, the best
matching keypoint can be found by retrieving nearest neighbors in the embedding space.

3.3.1 Training Procedure

Figure 6 illustrates the training architecture. The input to the network are batches of patch
triplets {p1 , p2 , p3 }. First, each patch is fed through a convolutional neural network to com-
pute its embedding feature vector, where three networks share the same weights. The feature
vectors are then normalized to lie on the d-dimensional unit hypersphere. Afterwards, the
pairwise Euclidean distances between the feature vectors of anchor and the two other patches
are computed. The network is then supervised with a triplet ranking loss shown in Equation 5
to project the matching patches closer in the feature space and the non-matching patches.
The patch triplets are sampled online. The anchor and the positive match are drawn
from the match set M. The negative patches are sampled at random. In order to ensure
good convergence, it is important to sample triplets that induce loss, i.e., they violate the
triplet constraint. To get triplets that violate the constraints, we perform online hard negative
mining. In particular, for each matching pair (anchor and positive) in the training batch,
we choose the negative patch in the batch that violates the triplet constraint the most. All
matching pairs within a batch can choose from the same set of negative patches.

3.3.2 Training Objective

We define a loss-function LT for training the keypoint description network, as follows. Given
triplets T = {t j : (p1j , p2j , p3j )} and a scalar margin h, the loss function is given by:
1 h i
LT = ∑ max 0, D(p1j , p2j ) − D(p1j , p3j ) + h (5)
N j

where h is chosen as 0.2 and D is a Euclidean distance function defined on the embedding
feature vectors, which are computed from the image patches.
D(pa , pb ) = k f (pa ) − f (pb )k2 (6)
8 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS

Model Component Structure

Feature Detection C3/128/2-BN-P3/2-C3/128/1-BN-P3/2-C3/d/1-Repeat{C3/d/1-BN}
Keypoint Scoring C3/64/1-BN-C1/1/1
Patch Matching C3/128/2-BN-P3/2-C3/256/1-BN-P3/2-C3/256/1-BN-P3/2-L2Normalize
Table 1: Network structure parameters. Convolution is denoted with Ck/ f /s, where k is the
kernel size, f is the number of filters or outputs, and s is the stride. Similarly, Max Pooling is
denoted with Pk/s, batch normalization is BN, and fully-connected layers are FC. Parameter
d denotes the number of filters, in the convolutional layers that vary among experiments.

3.4 Full Model

After both networks are trained, keypoint detection and matching can be performed. The
process is similar to the traditional keypoint extraction and description pipeline.
First, a whole image is fed through the fully convolutional detection network. A sample
output is shown in Figure 8. From the output feature map, a set of keypoints are extracted by
filtering with non-maximum suppression. Then, for each keypoint, we crop a patch accord-
ing to the detected scale and rescale it to 64 × 64, the canonical patch size of the description
network. Subsequently, the keypoint descriptors are computed with the triplet network. Fi-
nally, given the keypoint descriptions for two images, corresponding keypoints are found
using nearest neighbor search.

4 Experiments
4.1 Experimental Setup and Model Parameters
We verify the effectiveness of our model by testing on a separate held-out testset, which
was formed by removing cameras (images) from the structure-from-motion 3D model prior
to training. The held-out set is comprised of about 800K patches of varying scale, with
matching information.
The specific parameters of the networks used in the experiments are described in Ta-
ble 1. All convolutions are without padding. For optimization, we used Stochastic Gradient
Descent with a learning rate of 0.01, momentum of 0.9, and a weight decay of 0.005.

4.2 Keypoint Detection

To test the keypoint detector, we run different versions of the keypoint-detection network,
and compute precision/recall for each. The networks differ in the number of feature dimen-
sions (referred to with d in Table 1), and hard-negative mining procedure.
Our first network “KP-1” has d = 256 and uses hard-negative mining from the first iter-
ation. The second network “KP-2” has d = 256 and uses hard-negative mining starting from
mid-training with the whole batch comprised of hard-negatives. The last network “KP-3”
has d = 128 and follows the same hard-negative mining procedure as “KP-1”. The preci-
sion/recall curves are shown in Figure 7.
The results indicate that using hard-negative mining from early training allows the model
to find a better solution as opposed to start using hard-negatives only during mid-training.
One explanation could be that the model may have already arrived at a good local-minimum
ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 9
Keypoint Detection Precision/Recall
1.00

0.95

0.90

0.85

0.80
Precision

0.75

0.70

0.65

0.60

0.55
KP-1 - AUC (89.3)
0.50 KP-2 - AUC (80.4)
KP-3 - AUC (88.9)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Recall

Figure 7: Precision/Recall curves for dif- Figure 8: Sample keypoint detections on

ferent variants of our keypoint detector. a full-sized image.

for identifying keypoints. The results also show added benefits from additional model pa-
rameters. Overall, the model performance well with an area-under-the precision-recall curve
of 89.3 on keypoints of varying scales.

4.3 Patch Matching

To evaluate our triplet-based patch matching network, we compare against DeepCompare
[30] and MatchNet [12], which both leverage a siamese-based architecture. The evaluation
is based on a retrieval framework. For a randomly sampled pair of matching patches in the
test set, we use one of the patches as probe and the other as target. The target is mixed with
a set of non matching patches, for a total set of 100 patches. Then, given the probe, the task
is to find the matching patch within that set.
In our evaluation we report the retrieval at rank 1 (top-1%) and within ranks [1-5] (top-
5%). The test includes two variants of our network. The first variant “Triplet-1” is based on
a small number of convolution and pooling layers such as shown in Figure 6. The second
variant “Triplet-2” is based on the VGG-16 network [8]. We run each network and rank
the matches according to distance or similarity (MatchNet and DeepCompare, both follow a
similarity metric). Our testset contains 5K matching pairs, randomly sampled from the test

Method Top-1% Top-5%

Triplet-1 73.8 93.4
Triplet-2 76.6 95.5
MatchNet [12] - Liberty 57.3 82.3
MatchNet - Yosemite 44.0 73.1
MatchNet - Notredame 52.5 78.6
DeepCompare [30]- 2ch - Liberty 71.1 88.7
DeepCompare - 2ch - Yosemite 70.9 88.6
DeepCompare - 2ch - Notredame 71.9 88.0
DeepCompare - siam - Liberty 67.6 90.0
DeepCompare - siam - Yosemite 70 88.6
DeepCompare - siam - Notredame 70.7 91.0
Table 2: Retrieval at rank 1 (top-1%) and within ranks [1-5] (top-5%) on our test-set.
10 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS

Figure 9: Qualitative evaluation of feature transferability: Keypoint detection and matching

results from network trained on aerial imagery and tested on “Wall’ image from the Oxford
dataset [19]. For the first two images the network successfully retrieves the correct homog-
raphy. The result on the third is partly correct. The last two image demonstrate failure cases.

set. We perform two runs independently, and report the average in Table 2. For DeepCom-
pare [30], we report results only for the two best variants (out of five) for brevity. The results
show that the proposed method outperforms previous results. We believe this is due to the
structure of the embedding learned by the triplet loss function which is more suitable for
ranking purposes.

4.4 Extending to Other Datasets

To evaluate the generalization of the learned keypoint detector and descriptor, we present
qualitative results for our learned models on a dataset with different image statistics. In
particular, we applied ours models on the “Wall” sequence from the Oxford dataset [19].
In Figure 9, we show the five image sequence comparing the first image with the rest
of the images in the sequence. Our network shows good results in the first two images,
retrieving the correct homography. The results on the third image is partly correct, and
the last two image demonstrate failure cases. The type of images differs largely between
our training dataset and the test images. However, the approach shows promising results
indicating good capabilities of extending to other datasets.

5 Conclusion and Future Work

Feature extraction and description is a central problem in computer vision. We presented a
novel deep learning architecture capable of multiscale keypoint detection and description.
Our approach serves as a step to bring classical approaches closer together with the recent
progress in deep learning. We plan to further investigate the models performance on other
benchmarks and explore other avenues for multiscale detection and description.

Acknowledgments
We would like to thank Michael Wilber and Tsung-Yi Lin for their valuable input. This work
was supported by the KACST Graduate Studies Scholarship.
ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 11

References
[1] Hani Altwaijry, Eduard Trulls, James Hays, Pascal Fua, and Serge Belongie. Learning
to Match Aerial Images with Deep Attentive Architectures. In CVPR, 2016.

[2] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition
with visual attention. In ICLR, 2015.

[3] Boris Babenko, Piotr Dollár, and Serge Belongie. Task specific local region matching.
In ICCV, 2007.

[4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features.
In ECCV 2006.

[5] J. Bromley, I. Guyon, Y. Lecun, E. Sckinger, and R. Shah. Signature verification using
a “siamese” time delay neural network. In NIPS, 1994.

[6] Matthew Brown, Gang Hua, and Simon Winder. Discriminative learning of local image
descriptors. PAMI, 2011.

[7] Michael Calonder, Vincent Lepetit, Mustafa Ozuysal, Tomasz Trzcinski, Christoph
Strecha, and Pascal Fua. BRIEF: Computing a local binary descriptor very fast. PAMI,
2012.

[8] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the
details: Delving deep into convolutional nets. In British Machine Vision Conference,
2014.

[9] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection.
In CVPR 2005.

[10] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for
model fitting with applications to image analysis and automated cartography. Commu-
nications of ACM, 1981.

[11] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierar-
chies for accurate object detection and semantic segmentation. In CVPR, 2014.

[12] Xufeng Han, Thomas Leung, Yangqing Jia, Rahul Sukthankar, and Alexander C Berg.
MatchNet: Unifying feature and metric learning for patch-based matching. In CVPR,
2015.

[13] W. Hartmann, M. Havlena, and K. Schindler. Predicting matchability. In CVPR, 2014.

[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial
transformer networks. In NIPS, 2015.

[15] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with
deep convolutional neural networks. In NIPS. 2012.

[16] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. Brisk: Binary robust
invariant scalable keypoints. In ICCV, 2011.
12 ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS

[17] Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence?
In NIPS, 2014.
[18] David G. Lowe. Object recognition from local scale-invariant features. ICCV 1999.

[19] Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri
Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of affine
region detectors. IJCV, 2005.
[20] Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. Recurrent models of visual atten-
tion. In NIPS, 2014.

[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards
real-time object detection with region proposal networks. In NIPS, 2015.
[22] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embed-
ding for face recognition and clustering. In CVPR, 2015.

[23] Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and
Francesc Moreno-Noguer. Discriminative learning of deep convolutional feature point
descriptors. In ICCV, 2015.
[24] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Learning local feature de-
scriptors using convex optimisation. PAMI, 2014.

[25] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In CVPR, 2015.
[26] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing
the gap to human-level performance in face verification. In CVPR, 2014.
[27] Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A tempo-
rally invariant learned DEtector. In CVPR, 2015.
[28] Changchang Wu. Towards linear-time incremental structure from motion. In 3DV,
2013.

[29] Changchang Wu, Sameer Agarwal, Brian Curless, and Steven M Seitz. Multicore bun-
dle adjustment. In CVPR, 2011.
[30] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via
convolutional neural networks. CVPR, 2015.

Computer Vision: SIFT Features
No ratings yet
Computer Vision: SIFT Features
20 pages
Sift
No ratings yet
Sift
28 pages
A Review On Multiscale-Deep-Learning Applications
No ratings yet
A Review On Multiscale-Deep-Learning Applications
28 pages
Object Detection Using Python OpenCV
No ratings yet
Object Detection Using Python OpenCV
23 pages
Unit 2 Updated
No ratings yet
Unit 2 Updated
48 pages
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
No ratings yet
Bay, Herbert Et Al. (2008) - Speed-Up Robust Features
14 pages
Feature Detection for Developers
No ratings yet
Feature Detection for Developers
15 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
41 pages
Art Center Image Recognition
No ratings yet
Art Center Image Recognition
5 pages
CVI Week 2 1 Pre Note
No ratings yet
CVI Week 2 1 Pre Note
56 pages
Exp 5 - Saw
No ratings yet
Exp 5 - Saw
7 pages
Computer Vision: SIFT Explained
No ratings yet
Computer Vision: SIFT Explained
66 pages
CV4 F
No ratings yet
CV4 F
43 pages
Thesis Z Ai
No ratings yet
Thesis Z Ai
46 pages
Global Feature?: Local Feature Detection and Extraction
No ratings yet
Global Feature?: Local Feature Detection and Extraction
6 pages
Entropy-Driven Unsupervised Keypoint Representation Learning in Videos
No ratings yet
Entropy-Driven Unsupervised Keypoint Representation Learning in Videos
29 pages
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
No ratings yet
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
60 pages
Corner and Interest Point Detection
No ratings yet
Corner and Interest Point Detection
37 pages
Superpoint
No ratings yet
Superpoint
13 pages
Deep Learning for Visual Experts
No ratings yet
Deep Learning for Visual Experts
58 pages
V-Unit AIIA Complete Material
No ratings yet
V-Unit AIIA Complete Material
162 pages
A High Performance Fpga Based Image Feature Detector and Matcher Based On The Fast and Brief Algorithms
No ratings yet
A High Performance Fpga Based Image Feature Detector and Matcher Based On The Fast and Brief Algorithms
15 pages
Trainable COSFIRE Filters For Keypoint Detection and Pattern Recognition
No ratings yet
Trainable COSFIRE Filters For Keypoint Detection and Pattern Recognition
15 pages
Wjarr 2025 2647
No ratings yet
Wjarr 2025 2647
11 pages
Computer Vision Essentials Guide
No ratings yet
Computer Vision Essentials Guide
28 pages
SIFT Detector FPCV-2-3
No ratings yet
SIFT Detector FPCV-2-3
22 pages
L D C E S S: Earning Ense Onvolutional Mbeddings FOR Emantic Egmentation
No ratings yet
L D C E S S: Earning Ense Onvolutional Mbeddings FOR Emantic Egmentation
10 pages
Last Lab Report
No ratings yet
Last Lab Report
6 pages
Computer Vision Unit 3
No ratings yet
Computer Vision Unit 3
19 pages
R-CNN Variants in Object Detection
No ratings yet
R-CNN Variants in Object Detection
8 pages
Real-Time CNN Visual Recognition
No ratings yet
Real-Time CNN Visual Recognition
13 pages
Lecture 19
No ratings yet
Lecture 19
19 pages
Point Feature Detection & Matching
No ratings yet
Point Feature Detection & Matching
65 pages
CHAP 7 Features Recognition and Classification
No ratings yet
CHAP 7 Features Recognition and Classification
94 pages
Computer Vision55
100% (1)
Computer Vision55
268 pages
Implementation of Object Detection and Recognition Algorithms On A Robotic Arm Platform Using Raspberry Pi
No ratings yet
Implementation of Object Detection and Recognition Algorithms On A Robotic Arm Platform Using Raspberry Pi
8 pages
Deep Learning for Image Splicing Detection
No ratings yet
Deep Learning for Image Splicing Detection
15 pages
Machine Learning For High-Speed Corner Detection: July 2006
No ratings yet
Machine Learning For High-Speed Corner Detection: July 2006
15 pages
Manuscript Template 2
No ratings yet
Manuscript Template 2
13 pages
Summary
No ratings yet
Summary
65 pages
CV Unit-2
No ratings yet
CV Unit-2
21 pages
20 CVPR SCNet
No ratings yet
20 CVPR SCNet
10 pages
VINS Accurate Localization and Layout Mapping For Structural Environments Leveraging Hierarchical Geometric Representations
No ratings yet
VINS Accurate Localization and Layout Mapping For Structural Environments Leveraging Hierarchical Geometric Representations
18 pages
Keypoint Recognition with Trees
No ratings yet
Keypoint Recognition with Trees
29 pages
Image Patch-Matching With Graph-Based Learning
No ratings yet
Image Patch-Matching With Graph-Based Learning
18 pages
Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
No ratings yet
Predicting Images Using Convolutional Networks - Visual Scene Understanding With Pixel Maps
149 pages
Invariant Features From Interest Point Groups
No ratings yet
Invariant Features From Interest Point Groups
10 pages
InvariantFeaturesFromInterestPointGroups Brown2002 PDF
No ratings yet
InvariantFeaturesFromInterestPointGroups Brown2002 PDF
10 pages
Local Feature Matching Using Deep Learning - A Survey
No ratings yet
Local Feature Matching Using Deep Learning - A Survey
30 pages
LIFT: Learned Invariant Feature Transform: Kwang - Yi, Eduard - Trulls, Pascal - Fua @epfl - CH, Lepetit@icg - Tugraz.at
No ratings yet
LIFT: Learned Invariant Feature Transform: Kwang - Yi, Eduard - Trulls, Pascal - Fua @epfl - CH, Lepetit@icg - Tugraz.at
16 pages
Camera Pose Estimation Using CNN: Bhattarabhorn Wattanacheep Orachat Chitsobhuk
No ratings yet
Camera Pose Estimation Using CNN: Bhattarabhorn Wattanacheep Orachat Chitsobhuk
5 pages
Patch 2 Pix
No ratings yet
Patch 2 Pix
17 pages
LIFT: Learned Invariant Feature Transform
No ratings yet
LIFT: Learned Invariant Feature Transform
16 pages
3local Features
No ratings yet
3local Features
76 pages
Lich Su Dang
No ratings yet
Lich Su Dang
6 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
Automated Pavement Distress Detection in Road Maintenance Management Necessity, Innovations, and Challenges - A Literature Review
No ratings yet
Automated Pavement Distress Detection in Road Maintenance Management Necessity, Innovations, and Challenges - A Literature Review
12 pages
TSP Cmes 2455513
No ratings yet
TSP Cmes 2455513
38 pages
Adaptive Video-Based Algorithm For Accident Detection On Highways
No ratings yet
Adaptive Video-Based Algorithm For Accident Detection On Highways
6 pages
A Combined Modular System For Face Detection Head Pose Estimation Face Tracking and Emotion Recognition in Thermal Infrared Images
No ratings yet
A Combined Modular System For Face Detection Head Pose Estimation Face Tracking and Emotion Recognition in Thermal Infrared Images
6 pages
Geo-Registration of Satellite Images
No ratings yet
Geo-Registration of Satellite Images
43 pages
Predicting CT Image From MRI Data Through Feature Matching With Learned Nonlinear Local Descriptors
No ratings yet
Predicting CT Image From MRI Data Through Feature Matching With Learned Nonlinear Local Descriptors
11 pages
Ipcv Final
No ratings yet
Ipcv Final
51 pages
Automatic Helmet Detection
No ratings yet
Automatic Helmet Detection
9 pages
CSE 185 Introduction To Computer Vision: Feature Matching
No ratings yet
CSE 185 Introduction To Computer Vision: Feature Matching
48 pages
Unit 2
No ratings yet
Unit 2
4 pages
CVT Assignment
No ratings yet
CVT Assignment
28 pages
42.image Stitching Using Machine Learning
No ratings yet
42.image Stitching Using Machine Learning
4 pages
R23 III-I CSE (AI) Computer Vision and Image Processing Question Bank
100% (2)
R23 III-I CSE (AI) Computer Vision and Image Processing Question Bank
27 pages
Road Segment Re-Identification in Dashcam Videos
No ratings yet
Road Segment Re-Identification in Dashcam Videos
6 pages
Projective Geometry and Camera Models: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Projective Geometry and Camera Models: Computer Vision Jia-Bin Huang, Virginia Tech
70 pages
Collaborative Clustering for RSISC
No ratings yet
Collaborative Clustering for RSISC
11 pages
1 s2.0 S0167865510001169 Main
No ratings yet
1 s2.0 S0167865510001169 Main
11 pages
Digital Image Forgeries and Passive Image Authentication Techniques: A Survey
100% (1)
Digital Image Forgeries and Passive Image Authentication Techniques: A Survey
18 pages
Researchconclavenitish 220715141929 Be03069b
100% (1)
Researchconclavenitish 220715141929 Be03069b
11 pages
Auto Level Color Correction For Underwater Image Matching Optimization
No ratings yet
Auto Level Color Correction For Underwater Image Matching Optimization
6 pages
Li Rui Chek 2014
No ratings yet
Li Rui Chek 2014
25 pages
Facial Landmark Detection Survey
No ratings yet
Facial Landmark Detection Survey
28 pages
Construction Waste Sorting
No ratings yet
Construction Waste Sorting
15 pages
2010 - CNF - IH - Detection of Copy-Rotate-Move Forgery Using Zernike Moments (Pre-Proceedings) - 1
No ratings yet
2010 - CNF - IH - Detection of Copy-Rotate-Move Forgery Using Zernike Moments (Pre-Proceedings) - 1
15 pages
17 23 Minor Project Final 2025
No ratings yet
17 23 Minor Project Final 2025
59 pages
What Is Object Detection in Computer Vision
No ratings yet
What Is Object Detection in Computer Vision
8 pages
Deep Learning for Vehicle Safety
No ratings yet
Deep Learning for Vehicle Safety
60 pages
Forensic Analysis On Image Temperring
No ratings yet
Forensic Analysis On Image Temperring
69 pages
2011 SSRR KohlbrecherMeyerStrykKlingauf Flexible SLAM System
No ratings yet
2011 SSRR KohlbrecherMeyerStrykKlingauf Flexible SLAM System
7 pages
Computer Vision 8th Sem Lab Manual
100% (1)
Computer Vision 8th Sem Lab Manual
29 pages

Learning Detect Match

Uploaded by

Learning Detect Match

Uploaded by

ALTWAIJRY, VEIT, BELONGIE: LEARNING TO DETECT AND MATCH KEYPOINTS 1

Learning to Detect and Match Keypoints

Input Image Keypoint Responses Points

demonstrated to perform well in various applications of computer vision. Furthermore, the

2.2 Deep-learning and Matching Images

Figure 2: Generating multiscale matching patches using structure-from-motion.

3.1 Training Data

P = {pi : (xi , si , ki ); k ∈ {−1, 1}} (1)

3.2 Detection Network

3.2.1 Training Procedure

3.2.2 Training Objective

1x1 Convolution 3x3 Pooling

with x j as network output, y j ∈ {−1, 1} as the training label, and λ as mixing-weight.

3.3 Description Network

3.3.1 Training Procedure

3.3.2 Training Objective

Model Component Structure

3.4 Full Model

4.2 Keypoint Detection

Figure 7: Precision/Recall curves for dif- Figure 8: Sample keypoint detections on

4.3 Patch Matching

Method Top-1% Top-5%

Figure 9: Qualitative evaluation of feature transferability: Keypoint detection and matching

4.4 Extending to Other Datasets

5 Conclusion and Future Work

[13] W. Hartmann, M. Havlena, and K. Schindler. Predicting matchability. In CVPR, 2014.

You might also like