Keypoint Recognition Using Randomized Trees
Keypoint Recognition Using Randomized Trees
Keypoint Recognition Using Randomized Trees
Pascal.Fua}@epfl.ch
http://cvlab.epfl.ch
Abstract
In many 3D object-detection and pose-estimation problems, run-time performance is of critical
importance. However, there usually is time to train the system, which we will show to be very useful.
Assuming that several registered images of the target object are available, we developed a keypoint-based
approach that is effective in this context by formulating wide-baseline matching of keypoints extracted
from the input images to those found in the model images as a classification problem. This shifts much
of the computational burden to a training phase, without sacrificing recognition performance. As a result,
the resulting algorithm is robust, accurate, and fast-enough for frame-rate performance.
This reduction in run-time computational complexity is our first contribution. Our second contribution is to show that, in this context, a simple and fast keypoint detector suffices to support detection and
tracking even under large perspective and scale variations. While earlier methods require a detector that
can be expected to produce very repeatable results in general, which usually is very time-consuming,
we simply find the most repeatable object keypoints for the specific target object during the training
phase.
We have incorporated these ideas into a real-time system that detects planar, non-planar, and
deformable objects. It then estimates the pose of the rigid ones and the deformations of the others.
Index Terms
Image Processing and Computer Vision, Object recognition, Tracking, Statistical, Classifier design
and evaluation, Edge and feature detection.
This work was supported in part by the Swiss National Science Foundation.
I. I NTRODUCTION
In many 3D object-detection and pose estimation problems ranging from Augmented Reality
to Visual Servoing, run-time performance is of critical importance. However, there usually is
time to train the system before actually using it. Furthermore 3D models, or multiple images
from which such models can be built, tend to be available. As shown in Figures 1 and 2, we
describe here a technique designed to operate effectively in this context by shifting much of
the computational burden to the training phase so that run-time detection becomes both fast and
reliable.
Our approach, like many others, relies on matching interest points extracted from training
images and those extracted from input images acquired at run-time under potentially large
perspective and scale variations. What is new is to formulate this wide-baseline matching problem
as a classification problem. More specifically, we consider the set of all possible appearances of
each individual object keypoint as a class, which we refer to as the view-set. During training,
given at least one image of the target object, we extract interest points and generate numerous
synthetic views of their possible appearance under perspective distortion, which are then used
to train a classifier. It is used at run-time to recognize the keypoints under perspective and scale
variations by deciding to which view-set, if any, their appearance belongs. We advocate the use
of randomized trees [2] as the classification technique, because they naturally handle multi-class
problems and are robust and fast, while remaining reasonable easy to train.
Not only is this approach more principled than many earlier ones, but it yields a reduction in
run-time computational complexity, which is our first contribution. Our method does away with
having to compute ad hoc descriptors, rectify patches, and search for nearest-neighbors, which
are time-consuming steps common in earlier methods [29], [30], [3], [20], [18], [28], [16], [22].
Another contribution is to show that, in this context, a simple and fast keypoint detector suffices
for operation even under large perspective and scale variations, which results in a further increase
in performance. While earlier methods require detectors that can be depended upon to produce
very repeatable results, which can be very time-consuming, we simply find the most repeatable
object keypoints for the specific target object during the training phase,
In the remainder of the paper, we first discuss related work and formulate wide-baseline
matching as a classification problem. We then discuss the building of our view-sets and our use
DRAFT
Fig. 1.
Detection of a book in a video sequence: As shown by the white outline, the book is detected independently and
successfully in all frames at 25Hz in 640480 images on a standard PC, in spite of partial occlusion, cluttered background,
motion blur, large illumination and pose changes. In the last two frames, we add the inevitable virtual teapot to show we also
recover 3D pose.
DRAFT
but are not designed for robustness to occlusions, cluttered backgrounds, or poses different from
those in the training set.
By contrast, local approaches use simple 2D features such as corners or edges, which makes
them resistant to partial occlusions and cluttered backgrounds: Even if some features are missing,
the object can still be detected as long as enough are found and matched. Spurious matches can be
removed by enforcing geometric constraints, such as epipolar constraints between different views
or full 3D constraints if an object model is available. The use of weaker geometric constraints
is discussed in [1], [8] and is shown to achieve robustness to occlusion and clutter. However, to
be effective in our context, the feature extraction and characterization should also be insensitive
to viewpoint and illumination changes.
As proposed in [14], scale-invariant extraction can be achieved by taking the local extrema
of a Laplacian-of-Gaussian pyramid in scale-space as feature points. To increase computational
efficiency, the Laplacian can be approximated by a Difference-of-Gaussian [15]. More recent
work has focused on achieving affine invariant region detection to handle large perspective
distortions as well. [3], [28], [20] used an affine invariant point detector based on the Harris
detector, where the affine transformation that equalizes the two eigenvalues of the auto-correlation
matrix is evaluated to rectify the patch appearance. [30] achieves invariance by fitting an ellipse
to the local texture. [18] proposes a fast algorithm to extract Maximally Stable Extremal Regions.
[22] gives an extensive study on these affine invariant regions detectors.
Given the extracted feature points, various local descriptors have been proposed [29], [23].
Among these, the SIFT descriptor [16] has been shown to be one of the most effective [21].
It relies on local orientation histograms and tolerates significant local deformations. In [22],
it is applied to rectified affine invariant regions to achieve perspective invariance. In [26],
a similar result is obtained by training the system using multiple views of a target object,
storing all the SIFT features from these views, and matching against all of them. [11] performs
Principal Component Analysis on such local orientation histograms, which appears to improve
the reliability of this representation. However, such descriptors can be costly, and whatever
the chosen descriptor is, matching is performed by nearest-neighbor search, which tends to be
computationally expensive, even when using an efficient data structure [4].
By contrast, in earlier work [13], we argued that formulating wide-baseline matching as a
classification problem let us shift much of the computational burden to a training phase. This
10th February 2006
DRAFT
Fig. 2. The method is just as effective for 3D objects. In this experiment, we detected the stuffed tiger in the three images to
the right using a 3D model reconstructed from several views such as the two first images on the left. Note that it reprojects at
the right place in all three images.
reduces the cost of online matching while increasing its robustness. In this early work, we
used PCA as the classification technique. Here, we advocate further reducing the run-time
computational complexity by using Randomized Trees instead. They have been extensively
used for character recognition [2] and more recently for image classification [17]. They require
additional training in exchange for increased run-time efficiency because they let us replace
projection in the eigenspace and nearest-neighbor search by simple binary tests on image graylevels.
Classification as a technique for wide baseline matching has also been explored in parallel to
our own work [19]. In this approach, the training set is iteratively built from incoming frames.
While this is well-adapted for applications that do not allow for a training stage, this approach
was not designed to allow recognition from views that are very different from those already seen.
Furthermore, it uses kernel PCA, which is even more computationally expensive than PCA-based
classification.
III. W IDE BASELINE P OINT M ATCHING AS A C LASSIFICATION P ROBLEM
Our approach relies on matching keypoints found in an input image against those on a target
object O. Once potential correspondences have been established, we apply standard techniques
to estimate 3D pose. Therefore, the critical step in achieving results such as those depicted by
Figures 1 and 2 is the fast and robust wide-baseline matching that handling large perspective
and scale changes implies, which we formulate below in terms of a classification problem.
During training, we construct a set K = {k1 . . . kN } of N prominent keypoints lying on the
object model. At runtime, given an input patch p(kinput ) centered at a keypoint kinput extracted
from the input image, we want to decide whether or not its appearance matches that one of
10th February 2006
DRAFT
(a)
(b)
(c)
(d)
Fig. 3. Sampling the view-set. (a) For a keypoint on the cover of the book of Figure 1, we synthesize new views using random
affine transformations and adding white noise. (b) Same thing for another keypoint located on the border of the book. (c, d): In
the case of the stuffed tiger, the sampling is obtained using a textured 3D model seen from different viewpoints.
the N keypoints ki . In other words, we want to find for p(kinput ) its class label Y (p) C =
{1, 1, 2, . . . , N }, where the 1 label denotes all the points that do not belong to K. Since Y
cannot be directly observed, we build a classifier Y such as P (Y (p) 6= Y (p)) is small.
In other tasks such as face detection or character recognition, large training sets of labeled
data are usually available. However, for automated pose estimation, it would be impractical to
require a very large number of sample images. Instead, to achieve robustness with respect to
pose and complex illumination changes, we use a small number of images and synthesize many
new views of the object using simple rendering techniques. For each keypoint, this gives us a
sampling of its view set, that is the set of all its possible appearances under different viewing
conditions. The first two rows of Figure 3 depict such a sampling for two keypoints detected
on the book of Figure 1, the two last rows correspond to two keypoints detected on the stuffed
tiger of Figure 2.
These samplings are virtually infinite training sets that serve as input to the tree-building
algorithms. The resulting trees form a very fast Y run-time classifier as introduced above. Its
output is a set of matches that we use to estimate the pose using a standard robust estimation
technique.
10th February 2006
DRAFT
(a)
(b)
(c)
(d)
Fig. 4. Building the view-sets. (a) A synthetic view for the book cover, with extracted keypoints, (b) The most stable keypoints
selected by our method. (c, d) Same thing for the stuffed tiger.
DRAFT
within a single octave, while larger scale changes are handled by interest point detection at
several scales, as discussed below.
If the object is more complex, we can take advantage of a 3D model that has been built
either automatically or by hand using several images of the object. We use it in conjunction with
standard texture-mapping techniques to generate new views under perspective transformations.
In the case of the stuffed tiger of Figure 2, we used the ImageModeler software package 2 to
quickly build a very coarse 3D model, which has proved to be sufficient. Random profile and
frontal views of the tiger were generated by moving the camera around it while maintaining
the distance between 30 to 50 cm and the camera tilt angle between -30 and 30 degrees. As
a result, we can reliably detect the object from images acquired with cameras moving within
those bounds. Furthermore, thanks to the multi-scale matching scheme, we can actually place
the camera either further or closer.
We experimented with objects that exhibit relatively few specularities and therefore synthesized
our training images without accounting for lighting effects. We can nevertheless handle changes
in lighting because the tests we perform to classify patches, which we discuss in more detail in
Section V, rely on binary comparisons between image intensities. As a result, they are invariant
to the often monotonous intensity changes that lighting differences tend to produce in any given
neighborhood and lighting effects only perturb a very small fraction of the matches. Those
erroneous matches, being few in number, can easily be discarded by the robust estimator we
use for pose estimation. This is why the specular reflections and image saturation effects on
the book cover and the plastic part of the stuffed tiger do not degrade the performance of our
system, even though it has not been trained using images that exhibit such lighting effects. For
more complex cases, it should be possible to capture the material properties and generate very
realistic images [7]. However, it would be cumbersome and we have not found it to be necessary.
B. Fast Multiscale Keypoint Extraction
Our approach does not rely on a specific method for extracting keypoints, and we could
have used any of the many methods for multiscale detection that have recently been proposed.
However since we focus on real-time applications we developed our own method, which is
2
ImageModeler is a commercial product from Realviz(tm) that allows semi-automated 3D reconstruction from several views.
DRAFT
sufficiently fast and yet stable for our purposes. As was done in [21], [16], we look for extrema
of the Laplacian of Gaussian that are sufficiently cornerlike. However, instead of computing
some score map such as the Laplacian or the Harris matrix at each location and then searching
for the keypoints in this map, we proceed as follows.
Following a philosophy similar to the one proposed in [27], for each pixel m, we consider the
intensities along a discretized circle centered at m. We then eliminate all pixels whose gray level
is close to those of any two diametrically opposed pixels on this circle and that are therefore
unlikely to be stable keypoints. Among the remaining ones, we estimate the Laplacian using gray
level differences between pixels on the circle and the central one and retain only the locations
where this estimate is largest. This very simple approach often requires only a few intensity
comparisons to eliminate points in uniform areas and along edges and a few more to select
the maxima of the Laplacian. As a result, it is very fast and has proved sufficiently robust for
our purposes. For more details on this procedure, we refer the interested reader to a technical
report [12].
We run this algorithm on the first few octaves of the image and simultaneously use the interest
points detected at each octave to train the classifier, as described below. Figure 5 illustrates how
this simple procedure and the keypoint classifier work in tandem to recognize the keypoints
under large variation of both scale and appearance. This contrasts with earlier techniques [22]
that depend on much more sophisticated and expensive methods to accurately extract the affine
deformation of a feature.
C. Keypoint Selection
Ideally, all the ki keypoints in K should have a high probability P (ki ) to be found if they
are visible, perspective distortion and image noise notwithstanding.
Let T be the geometric transformation used to synthesize a new view as described in Sece an interest point extracted from this view using the procedure of Section IV-B.
tion IV-A, and k
e we can recover
T is either an affine transformation or a projection and by applying T 1 to k,
a corresponding keypoint k in the reference system. Thus, given several synthetic views, P (k)
can be estimated by simply counting how often it is found. The set K is then constructed by
only retaining keypoints with a high P (k). In our experiments, we retain the 200 keypoints with
highest P (k). Figure 4 shows the keypoints selected on the book cover and the stuffed tiger. For
10th February 2006
DRAFT
Fig. 5. Robustness to scale and perspective changes. First row: The image on the left shows a keypoint selected on the original
image, the three images on the right show the same keypoint retrieved at different scales under perspective distortion, and some
saturation and blur effects. Second row: Same for another keypoint.
each keypoint ki K, we build the corresponding view set by collecting the p neighborhood
e in the generated images, as shown in Figure 4.
of the corresponding k
The above procedure accounts for sensitivity of the detector to perspective distortions. How-
ever, this may not be sufficient because, when a keypoint is detected in two different images,
its precise location may shift a bit due to image noise and viewpoint change. In practice, such a
positional shift results in large errors of direct cross-correlation measures. One solution would be
to iteratively refine the keypoint localization [20], which could be costly. Instead, we solve this
problem by adding white noise to the synthetic views before keypoint extraction. The resulting
view sets thus capture the positional shift, and force the classifier to learn invariance to this shift.
Each view is rendered against a different, complex random background. The classifier is
therefore trained to recognize the keypoints on cluttered background, including points that are
close to the object visual boundary. This contrasts with other patch-based methods that fail to
match a keypoint when the corresponding patch overlaps the background.
V. K EYPOINT C LASSIFICATION AND R ECOGNITION
Several classification algorithms, such as K-Nearest Neighbor, Support Vector Machines or
neural networks could have been chosen to implement the classifier Y introduced in Section III.
10th February 2006
DRAFT
10
m2
m
m1
m2
m
m2
m1
P( l ,p )(Y ( p)= c)
P( l ,p )(Y ( p)= c)
Fig. 6.
m1
Generic tree used for keypoint recognition. When using C 2 tests, the nodes contain tests comparing two pixels in the
keypoint neighborhood; the leaves contain the P(l,p) (Y (p) = c) posterior probabilities.
Among those, we have found randomized trees [2] to be eminently suitable because they naturally
handle multi-class problems and are robust and fast, while remaining reasonably easy to train.
They are simple but powerful tools for classification, introduced and applied to recognition of
handwritten digits in [2]. They are closely related to the regression trees in the CART method [5].
Several trees are grown with some form of randomization as in [6] for example, but the queries
can be more complex than those of regression trees. In this section we first describe them briefly
in the context of our problem for the benefit of the unfamiliar reader. We then study their
properties and justify our implementation choices.
A. Randomized Trees
Figure 6 depicts a generic tree. Each internal node contains a simple test that splits the space
of data to be classified, in our case the space of image patches. Each leaf contains an estimate
based on training data of the posterior distribution over the classes. A new patch is classified
by dropping it down the tree and performing an elementary test at each node that sends it to
one side or the other. When it reaches a leaf, it is assigned probabilities of belonging to a
10th February 2006
DRAFT
11
(a)
Fig. 7.
(b)
(c)
What makes the Randomized Trees tick. (a) and (b) are two partitions, which result in a finer one (c) when they are
superimposed. In a similar manner, each tree performs a different partition of the patch space, and combining their responses
results in a finer partition.
class depending on the distribution stored in the leaf. Since the numbers of classes, training
examples and possible tests are large in our case, building the optimal tree quickly becomes
intractable. Instead, multiple trees are grown so that each tree yields a different partition of the
space of image patches. We discuss two different ways to build the trees in Section V-C. Once
the trees T1 , . . . , TL are built, their responses are combined during classification to achieve a
better recognition rate that a single tree could. More formally, the tree leaves store posterior
probabilities P(l,p) (Y (p) = c), where c is a label in C and (l, p) is the leaf of tree T l reached
by the patch p. Such probabilities are evaluted during training as the ratio of the number of
patches of class c in the training set that reach and the total number of patches that reach .
Following [2], p is classified by considering the average of the probabilities P (l,p) (Y (p) = c):
1 X
Y (p) = argmax pc (p) = argmax
P(l,p) (Y (p) = c)
(1)
L l=1...L
c
c
Figure 7 gives an intuitive idea of what makes this approach a good one. Each tree performs
a different partition depending on the specific tests it contains. Combining the output of all the
trees results in a finer partition that yields a better classification. As the depth and number of
trees are increased, this partition becomes finer and finer, and the estimated distribution becomes
closer and closer to the actual one, at the cost of increasing both computational expense and
memory requirements. This trade-off will be studied in Section V-C.
pc (p) is the average of the posterior probabilities of class c and constitutes a good measure
of the match confidence. We estimate during training a threshold T c to decide if the match is
10th February 2006
DRAFT
12
2
C2 (m1 , m2 ) =
,
otherwise
go to right child
where I (p, m) is the intensity of patch p at pixel location m, after Gaussian smoothing to
reduce influence of noise. Such a test can be seen as a test on the polarity between the two
locations m1 and m2 . In all our experiments, the patches are of size 32 32, so that the total
number of possible C2 tests is 219 . Fortunately, since real-world images exhibit spatial coherence,
only a very small subset is required to yield good recognition rates.
As shown below, a few hundreds of these simple tests are usually enough to classify a patch.
This involves only a few hundreds intensity comparisons and additions per patch, and is therefore
very fast. Furthermore, because they only depend on the order of the pixel intensities between
neighbors, they tend to be fairly insensitive to illumination changes other than those caused by a
moving shadow. In other words, to achieve the robustness to illumination effects demonstrated in
Figure 1, our technique, unlike many others, does not require us to normalize the pixel intensities,
for example by setting the L2 norm of the intensities to one.
10th February 2006
DRAFT
13
Note, however, that one advantage of Randomized Trees is that they impose no restriction on
the kind of tests performed at the nodes. A tree can even contain different kinds of tests. An
infinite number of tests other than introduced above could have been designed, most of which
would be computationally more demanding. To show that the ones we chose represent a good
compromise between recognition ability and run-time speed, we will compare in Section V-D
their performance with those of more complex tests, including tests based on histograms of local
orientations and inspired by the SIFT detector [16].
C. Growing the Trees
To improve the recognition rate, we use multiple trees that should partition the patches space
in different manners, as depicted by Figure 7. We experimented with two different methods for
building such trees.
The first method is the one used in [2]: The trees are constructed in the classical, top-down
manner, where the tests are chosen by a greedy algorithm to best separate the given examples.
The expected gain in information is used to evaluate the separation efficiency. The gain caused
by partitioning a set S of examples in several subsets Si according to a given test is measured
as
E =
X |Si |
i
PN
j=1
|S|
E(Si ) ,
(2)
belonging to class j, and |.| denotes the size of the set. The process of selecting a test is repeated
for each non-terminal node, using only the training examples falling in that node. The recursion
is stopped when the node receives too few examples, or when it reaches a given depth. In our
implementation, the depth varies between 10 and 15, and the minimum of examples is fixed to
10.
The second method is much faster and simpler: Instead of picking questions according to a
criterion, we simply pick a random set. This can be seen as an extreme simplification of the first
method. For example, in the case of the C2 test, the two locations m1 and m2 for each node
are picked at random within the patch, independently of the training samples that fall into the
node and of the tests performed further up in the tree.
DRAFT
14
To compare the two tree-building methods we have introduced, we used them both on the set
of 200 keypoints depicted by Figure 4. This resulted in two sets of trees whose nodes contained
C2 binary tests and whose depth was limited to the same value.
When using the simpler approach, we grew the trees by randomly picking locations for each
node, as discussed above. When using the entropy minimizing approach, we first synthesized
100 new views different for each tree to grow. We then recursively built the trees by trying n
different tests at each node and keeping the best one according to the criterion of Equation (2).
For the root node, we chose n = 10, a very small number, to reduce the correlation between
the resulting trees. For all other nodes, we used n = 100d, where d is the depth of the node.
Note that this heuristic involves randomizing on both tests and training data. As noted in [2],
the randomizing on tests is far more powerful than randomizing on data. We do the latter, not
to improve classification performance, but to make our greedy algorithm tractable. Since intraclass variations are large, without it, too many examples would be needed to sample a class in
a representative way.
In the case of the completely random approach to building trees, m 1 and m2 were simply
chosen at random. For the two sets, the tree depth is limited to a given maximal depth, and the
posterior probabilities are estimated from 1000 new random views per keypoint.
We present here results for trees with a depth limited to 12, which was found to be a good
trade-off between the memory requirements and recognition rate. After having grown the trees,
the posterior probabilities in the terminal nodes were estimated using 5000 new training images.
We then measured the recognition rate R of the two sets of trees by generating new images
under random poses, as the ratio of the number of correctly recognized patches and the total
number of generated patches. The evolution of R for the two sets of trees with respect to the
number of trees is depicted Figure 8(a). Taking the tests at random usually results in a small
loss of reliability at least when the number of trees is not large enough but considerably reduces
the learning time. The time dedicated to growing the trees drops from tens of minutes to a few
seconds on a 2.8 GHz machine.
We also experimented with normalizing the p patches of Section III both during training and at
run-time to achieve higher recognition rates for a given number of trees. As in [16] we attribute
a 2D orientation to the keypoints that is estimated from the histogram of gradient directions in
a patch centered at the keypoint. Note that by contrast with [16], we do not require a particularly
10th February 2006
DRAFT
15
100
100
entropy optimization
random tests
entropy optimization
random tests
80
80
60
40
20
60
40
20
0
0
10
20
30
40
50
Tree Number
(a)
Fig. 8.
10
15
20
25
Tree Number
30
35
40
45
50
(b)
Comparing the classification rates obtained using trees grown in two different manners, as a function of the number
of trees. (a) Without and (b) with patch orientation normalization. The thick lines depict results obtained by selecting tests that
maximize the information gain. The thin lines depict results obtained by randomly chosen tests, which result in a small loss of
reliability but considerably reduces the training time. Note that in all cases the normalization lets us achieve better results with
fewer trees. However when enough trees are used, it does not improve the rates anymore.
repeatable method. We just want it to be reliable enough to reduce variation within classes. Once
the orientation of an extracted keypoint is estimated, its neighborhood is rectified. Figure 8(b)
compares the recognition rates with this normalization step for the two different methods of
selecting the tests. Taking the tests at random results in a slightly larger but still small loss of
reliability. More importantly, the normalization gives us significantly improved rates when using
only a small number of trees. However, when using a large number of trees, the recognition
rates are similar with and without the normalization.
We draw two practical conclusions from these experiments. First, using random tests is sufficient and keeps the learning time reasonable for practical applications. Second, the orientation
normalization step is not required, but lets us reduce the number of trees. Therefore the choice
of using such a normalization becomes a trade-off between the amount of time required to
normalize and to classify, which is proportional to the number of trees. The amount of memory
used by the trees should also be taken into account. In practice, we have found it more effective
to normalize and use fewer trees, in part because the normalization we use is much simpler than
a full affine rectification. Therefore, in the remainder of the paper, we normalize the patches and
use randomly chosen tests.
DRAFT
16
4
C4 (m1 , m2 , m3 , m4 ) =
otherwise
go to right child.
These tests compare both the strength and polarity of the edge between m 1 and m2 , and those
of the edge between m3 and m4 .
We also define the Ch family made of more sophisticated tests based on local orientation
histograms computed as follows. Like for the SIFT characterization, each patch is divided into
4 4 subregions, and an array of histograms with 8 orientation bins is computed for each region.
The tests in the Ch family compare the values of two bins:
The responses of the C2 , C4 and Ch families of tests were compared using two kinds of feature
points. In one case, we used a set we denote as Title. It includes 100 keypoints detected on
the title of the book cover of Figure 1, which represents strong and structured edges. The second
set of keypoints denoted Eyes is made of 100 keypoints detected on the picture of the same
book cover, which presents mostly textured areas. For both sets, we grew several sets of trees
by varying the tests and tree depths. We tested them for depths ranging from 10 to 15, as larger
values quickly become intractable due to the increasing memory requirements. Figures 9 and 10
summarize the results as a function of the number of trees and of their depth.
Somewhat surprisingly, the C4 tests results are slightly worse than the C2 ones. The Ch
tests slightly outperform the C2 tests on the Title set, which is not very textured, and yield
approximately the same recognition rate on the more textured Eyes set.
However, and this is remarkable, the differences are not really significant. To illustrate that,
in Figure 11, we superpose the results using the three kinds of tests and the same tree depth
equal to 15. This is a very interesting property of the classification approach: Complex tests are
10th February 2006
DRAFT
17
100
100
Depth 10
Depth 12
Depth 15
Depth 10
Depth 12
Depth 15
80
80
60
40
20
60
40
20
0
0
10
15
20
25
30
35
40
45
50
10
15
20
Tree Number
100
25
Tree Number
30
35
Depth 10
Depth 12
Depth 15
50
80
60
40
20
60
40
20
0
0
10
15
20
25
Tree Number
30
35
40
45
50
100
10
15
20
25
Tree Number
30
35
40
45
50
100
Depth 10
Depth 12
Depth 15
Depth 10
Depth 12
Depth 15
80
80
45
Depth 10
Depth 12
Depth 15
80
60
40
20
60
40
20
0
0
Fig. 9.
40
100
10
15
20
25
Tree Number
30
35
40
45
50
10
15
20
25
Tree Number
30
35
40
45
50
Comparison of recognition rates as a function of the number of trees, for different tree depths and different tests,
applied to the Title (left) and the Eyes (right) keypoints sets. First row: using the C 2 family tests; Second row: using the
C4 family tests; Third row: using the Ch family tests.
not necessary, simple ones suffice to reach similar performances. This can be explained by the
fact that the partition obtained when combining the trees becomes sufficiently fine when enough
trees are used, provided that each individual tree partitions the patch space in a different way.
Since the Ch tests are computationally much more costly, in practice we use the C 2 tests as
the slight loss in performance does not appear to have any ill-effect on our RANSAC-based
DRAFT
18
Title set
Eyes set
Fig. 10.
C2
C4
Ch
depth 10
60.7%
57.7%
66.6%
depth 12
69.2%
65.1%
75.0%
depth 15
77.0%
73.7%
82.4%
depth 10
72.7%
70.0%
74.5%
depth 12
78.6%
76.1%
84.2%
depth 15
84.7%
81.4%
84.2%
Classification rates using either the C2 , C4 and Ch test families using 50 trees. Tests from the Ch family slightly
100
100
C2
C4
Ch
C2
C4
Ch
80
80
60
40
20
60
40
20
0
0
10
15
20
25
Tree Number
30
35
40
45
50
(a)
Fig. 11.
10
15
20
25
Tree Number
30
35
40
45
50
(b)
Classification rates for C2 , C4 and Ch tests families as a function of the number of trees. (a) Title set, (b) Eyes
DRAFT
19
Fig. 12.
Fig. 13.
in and out of the field of view, the camera motion is very jerky and the illumination changes all
the timethe sail is detected in all frames where a sufficient portion of the sail is seen.
Our implementation has been tested on numerous objects under different lighting conditions
using a simple webcam, and an executable can be downloaded from our website. It runs at 25
frames per second on a 2.8 GHz PC.
Figures 2 and 18 show detection results for the stuffed tiger for which a quickly built 3D
Fig. 14.
Detecting a sail. Four frames from a 4000 frame video acquired using a home camcorder during a regatta. Even
though the camera jerks, the zoom changes and the lighting conditions are poor, the sail is detected in all frames, such as those
shown above, where a sufficient portion of the sail is seen.
10th February 2006
DRAFT
20
Fig. 15.
Detection of the book: The white lines represent the inlier matches established in real-time under several poses.
model. We use a P3P algorithm and RANSAC to robustly estimate its pose. The object object is
successfully detected from different sides, different distances and both from below and above.
We compared our results with those obtained using the executable that implements the SIFT
method [16] kindly provided by David Lowe. As shown in Figures 16 and 17, when there is little
distortion, the SIFT approach yields a few more matches. However, when too much perspective
distorts the object image, it produces far fewer, while our approach is not perturbed.
To be fair, we note again that this robustness comes at the cost of a training stage that
SIFT does not require. If training is possible, as suggested in [16] and demonstrated in [26],
SIFT can be made more robust to distortions by using multiple views of the target object,
storing all the SIFT features from these views, and matching against all of them. This does not
significantly increase the matching time when approximate nearest-neighbor k-D tree search is
used. It remains, however, less computationally effective than ours, which depends only on a
10th February 2006
DRAFT
21
Fig. 16. Comparison with SIFT (left image) and our approach (right image). When there is little distortion, the SIFT approach
gives a few more matches, but when too much perspective distorts the object image, it gives only few matches, while our
approach is not perturbed.
DRAFT
22
Fig. 17.
Comparison with SIFT (left image) and our approach (right image) on the mouse pad sequence.
is severely deformed. When combined with an appropriately defined robust estimator for the
keypoint distances and optimization schedule, this approach to minimization allows real-time
detection in under 100 milliseconds while being robust to large deformations, severe occlusions,
and changes in lighting.
DRAFT
23
Fig. 18. Our method can take advantage of a 3D model when available, to detect the target and estimate its pose from different
viewpoints.
(a)
Fig. 19.
(b)
Detecting a deforming sheet of paper at 10Hz on a 2.8 GHz PC. (a) The left image is the input image in which the
sheet of paper is detected and its deformation estimated. The right one is the training image. The lines represent the matches
recovered using our method. Black ones correspond to those estimated to be outliers and the white ones to inliers. (b) Same
thing for a different input image.
DRAFT
24
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Fig. 20. Detecting a deforming T-Shirt. (a) Model image used for training purposes. (b) To illustrate the mapping our algorithm
computes, we find the contours of the model using a simple gradient operator and we use them as a validation texture (c) which
is overlaid on the input image using the recovered transformation (d). Additional results are obtained in different conditions (e
to j). Note that in all cases, including the one where the T-shirt is replaced by a cup (j), the white outlines project almost exactly
at the right place, thus indicating a correct registration and shape estimation.
phase so that, at run-time, we can achieve both speed and reliability. Like many others, our
approach relies on matching keypoints extracted from one or more training images of the target
object against those extracted from input images. What is new is to have reformulated the widebaseline matching problem, which this involves, as a classification problem that can be effectively
solved using randomized trees that are both fast and easily implemented.
We have shown that very good results can be obtained using a very simple approach to building
the trees. More sophisticated building methods yield slight performance increases, but at the cost
of much larger requirements both in terms of computation and memory. In practice, given the
fact that we use robust techniques to estimate 3D poses from the matches, we have not found
it worthwhile to impose that extra burden as the simpler techniques are sufficient to achieve
robust frame-rate detection and pose estimation. However, we could easily replace individual
components of our approach, such as the keypoint detector, by improved or more generic ones
as they become available without changing its overall philosophy.
Our approach is well adapted to cases where one or more images of the target object are
available for off-line training purposes. If the object is either planar or nearly so, the images
DRAFT
25
can be directly used without any preprocessing. If the object is fully 3D, the images can be
used to build the rough 3D model required to train the system. Our next step is to speed up
the training procedure itself so that it can also become an in-line process, thereby removing
the major limitation of our approach when compared to state-of-the-art ones that do not require
any a priori training. One obvious way to achieve this is to first track the target object under
favorable conditions using standard techniques, which will allow us to follow keypoints across
frames and to incrementally build the view-sets and the corresponding trees. This will constitute
a starting point for our future investigations.
ACKNOWLEDGEMENTS
The authors would like to thank Franois Fleuret for suggesting the use of randomized trees
and for his insightful comments, and Tom Drummond for discussions on keypoint detection.
R EFERENCES
[1] Y. Amit. 2D Object Detection and Recognition: Models, Algorithms, and Networks. The MIT Press, 2002.
[2] Y. Amit and D. Geman. Shape Quantization and Recognition with Randomized Trees. Neural Computation, 9(7):1545
1588, 1997.
[3] A. Baumberg. Reliable Feature Matching across Widely Separated Views. In Conference on Computer Vision and Pattern
Recognition, pages 774781, 2000.
[4] J. Beis and D.G. Lowe. Shape Indexing using Approximate Nearest-Neighbour Search in High-Dimensional Spaces. In
Conference on Computer Vision and Pattern Recognition, pages 10001006, Puerto Rico, 1997.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Chapman & Hall, New
York, 1984.
[6] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123140, 1996.
[7] P. Debevec. Rendering synthetic objects into real scenes: Bridging traditional and image-based graphics with global
illumination and high dynamic range photography. In ACM SIGGRAPH, July 1998.
[8] R. Fergus, P. Perona, and A. Zisserman.
DRAFT
26
[13] V. Lepetit, J. Pilet, and P. Fua. Point Matching as a Classification Problem for Fast and Robust Object Pose Estimation.
In Conference on Computer Vision and Pattern Recognition, Washington, DC, June 2004.
[14] T. Lindeberg. Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics,
21(2):224270, 1994.
[15] D.G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, pages
11501157, 1999.
[16] D.G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision,
20(2):91110, 2004.
[17] R. Mare, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In Conference on
Computer Vision and Pattern Recognition, 2005.
[18] J. Matas, O. Chum, U. Martin, and T. Pajdla. Robust Wide Baseline Stereo from Maximally Stable Extremal Regions. In
British Machine Vision Conference, pages 384393, London, UK, September 2002.
[19] J. Meltzer, M.-H. Yang, R. Gupta, and S. Soatto. Multiple View Feature Descriptors from Image Sequences via Kernel
Principal Component Analysis. In European Conference on Computer Vision, pages 215227, may 2004.
[20] K. Mikolajczyk and C. Schmid. An Affine Invariant Interest Point Detector. In European Conference on Computer Vision,
pages 128142. Springer, 2002. Copenhagen.
[21] K. Mikolajczyk and C. Schmid. A Performance Evaluation of Local Descriptors. In Conference on Computer Vision and
Pattern Recognition, pages 257263, June 2003.
[22] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A
comparison of affine region detectors. Accepted to International Journal of Computer Vision, 2005.
[23] F. Mindru, T. Moons, and L. VanGool. Recognizing color patterns irrespective of viewpoint and illumination. In Conference
on Computer Vision and Pattern Recognition, pages 368373, 1999.
[24] S. K. Nayar, S. A. Nene, and H. Murase. Real-Time 100 Object Recognition System. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(12):11861198, 1996.
[25] J. Pilet, V. Lepetit, and P. Fua. Real-Time Non-Rigid Surface Detection. In Conference on Computer Vision and Pattern
Recognition, San Diego, CA, June 2005.
[26] D. Pritchard and W. Heidrich. Cloth motion capture. In Eurographics, volume 22(3), pages 263271, September 2003.
[27] E. Rosten and T. Drummond. Fusing points and lines for high performance tracking. In International Conference on
Computer Vision, Beijing, China, October 2005.
[28] F. Schaffalitzky and A. Zisserman. Multi-View Matching for Unordered Image Sets, or "How Do I Organize My holiday
Snaps?". In Proceedings of European Conference on Computer Vision, pages 414431, 2002.
[29] C. Schmid and R. Mohr. Local Grayvalue Invariants for Image Retrieval. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 19(5):530534, May 1997.
[30] T. Tuytelaars and L. VanGool. Wide Baseline Stereo Matching based on Local, Affinely Invariant Regions. In British
Machine Vision Conference, pages 412422, 2000.
[31] P. Viola and M. Jones. Rapid Object Detection using a Boosted Cascade of Simple Features. In Conference on Computer
Vision and Pattern Recognition, pages 511518, 2001.
DRAFT
27
Vincent Lepetit received the engineering and master degrees in Computer Science from the ESIAL in
PLACE
PHOTO
HERE
1996. He received the PhD degree in Computer Vision in 2001 from the University of Nancy, France,
after working in the ISA INRIA team. He then joined the Virtual Reality Lab at EPFL (Swiss Federal
Institute of Technology) as a post-doctoral fellow and became a founding member of the Computer Vision
Laboratory. He has received several awards in Computer Vision including the best paper award at CVPR
2005. His research interests include vision-based Augmented Reality, 3D camera tracking, and object
recognition.
Pascal Fua received an engineering degree from Ecole Polytechnique, Paris, in 1984 and the Ph.D. degree
PLACE
PHOTO
HERE
in Computer Science from the University of Orsay in 1989. He joined EPFL (Swiss Federal Institute of
Technology) in 1996 where he is now a Professor in the School of Computer and Communication Science.
Before that, he worked at SRI International and at INRIA Sophia-Antipolis as a computer scientist. His
research interests include human body modeling from images, optimization-based techniques for image
analysis and synthesis, and the use of information theory in the area of model-based vision. He has
(co)authored over 100 publications in refereed journals and conferences. He is an associate editor of IEEE journal Transactions
for Pattern Analysis and Machine Intelligence and has been a program committee member of several major vision conferences.
DRAFT
28
Lepetit
(Corresponding
author)
+41-21-693-6716
E-mail:
vincent.lepetit@epfl.ch
Post
address:
EPFL
/ IC / ISIM
Station
14
CH-1015
Lausanne
/ CVLab
SWITZERLAND
Pascal
Tel:
Fua
+41-21-693-7519
E-mail:
pascal.fua@epfl.ch
Post
address:
EPFL
/ IC / ISIM
Station
14
CH-1015
Lausanne
/ CVLab
SWITZERLAND
DRAFT