[go: up one dir, main page]

0% found this document useful (0 votes)
8 views9 pages

Affordable Infrared-Optical Pose-Tracking For Virt

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228648906

Affordable infrared-optical pose-tracking for virtual and augmented reality

Article · January 2007

CITATIONS READS
79 1,687

2 authors:

Thomas Pintaric Hannes Kaufmann


TU Wien TU Wien
13 PUBLICATIONS 666 CITATIONS 154 PUBLICATIONS 4,464 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Hannes Kaufmann on 10 June 2014.

The user has requested enhancement of the downloaded file.


Affordable Infrared-Optical Pose-Tracking
for Virtual and Augmented Reality
Thomas Pintaric∗ Hannes Kaufmann†

Interactive Media Systems Group


Institute of Software Technology and Interactive Systems
Vienna University of Technology

A BSTRACT
In this paper, we describe the hard- and software of a new low-cost
infrared-optical pose-tracking system for room-sized virtual envi-
ronments. The system consists of 4-8 shutter-synchronized 1394-
cameras with an optical bandpass filter and infrared illuminator. All
image-processing is done in software on an attached workstation.
Preliminary results indicate low latency (20-40 ms), minimal jitter
(RMS less than 0.05 mm / 0.02°), submillimeter location resolution
and an absolute accuracy of ±0.5 cm. Currently, ten independent
6-DOF targets can be tracked in real-time with up to 60Hz.
Keywords: Infrared-optical 6-DOF pose-tracking, low-cost.
Index Terms: I.4.0 [Computing Methodologies]: Image Process-
ing and Computer Vision—General; I.3.8 [Computing Methodolo-
gies]: Computer Graphics—Applications
Figure 1: An early prototype of our pose-tracker in a mobile, tripod-
1 M OTIVATION mountable configuration for room-sized virtual environments.
Time and again, a lack of affordability has delayed or averted the
adoption of an otherwise viable technology. In the context of Vir-
tual and Augmented Reality (VR/AR), the single largest expense deployed at locations in three different countries. At the time of
for building a room-sized virtual environment is almost always in- this writing (January 2007), we are beginning to make the system
curred by the acquisition of a wide-area 6-DOF pose-tracker. commercially available to interested parties for a price of six to
Only a few vendors offer commercial systems that extend be- eight thousand Euros (depending on the configuration). Our goal is
yond 20 square meters in coverage and are capable of tracking at to provide an affordable, high-quality tracking to a broad range of
least six independent targets with a precision, latency and measure- smaller VR/AR application developers.
ment frequency that is considered adequate for a range of VR/AR The remainder of this paper summarizes our requirement anal-
applications (see Section 2). Systems meeting those criteria in- ysis, outlines hard- and software design choices, and gives prelim-
clude 3rd Tech’s HiBall [1] (inside-out, infrared-optical), the In- inary performance results. It is intended as a high-level overview,
tersense IS-1200 [2] (inside-out, hybrid optical/inertial) and A.R.T. not an in-depth discussion of algorithmic or constructional details.
ARTrack1 (outside-in, infrared-optical). A brief survey of available
optical tracking systems is given in [3]. In a configuration tailored 2 R EQUIREMENT A NALYSIS
to a room-sized multi-user environment, all of the aforementioned
systems have price tags in the range of tens of thousands of Euros. Choosing commodity hardware over custom-built components has
While corporate entities and well-funded research laboratories will always been a reliable cost-minimization strategy. The ready avail-
not be deterred by such amounts, it is the authors’ first-hand ex- ability of inexpensive electronic imaging devices with a standard-
perience that many smaller educational institutions, especially sec- ized computer interface (e.g. IEEE-1394) prompted us to chose
ondary schools, operate on tightly constrained budgets that leave optical tracking over competing technologies, mainly because it
little, if any, room for an expense of this magnitude, even if third- would allow for all computation to be done on a regular PC work-
party subsidies are available. station. Within the domain of optical tracking, outside-in trackers,
It was an attempt to build an affordable hardware platform for usually composed of wall- or tripod mounted cameras directed to-
Construct3D [4], a collaborative Augmented Reality application wards the center of the interaction volume, exhibit better scalability
designed to supplement the mathematics and geometry education (in a monetary, not a computational sense) to multiple targets than
curriculum of secondary schools, that prompted us to build our own inside-out self-trackers. Additionally, the installation and configu-
wide-area tracker. Born out of necessity in early 2005, our tracking ration of outside-in trackers is usually faster and easier than that of
system has since undergone three design iterations and is currently existing inside-out systems.
The basic principle of optical outside-in tracking lies in estimat-
∗ e-mail: pintaric@ims.tuwien.ac.at ing the pose of known rigid 3D feature-point constellations, com-
† e-mail: kaufmann@ims.tuwien.ac.at monly referred to as “targets”, from their projections in different
views. To simplify the task of identifying relevant features in a cam-
era image, targets are usually built from clusters of light-emitting
diodes (“active targets”) or retro-reflective spheres (“passive tar-
gets”). In the latter case, which we opted for, all cameras must
1 http://www.ar-tracking.de be equipped with a strobelight. The added expense is an initial dis-
(a) Development prototypes from early 2005 (left), mid-2006 (middle) and final camera design (right). (b) Trigger box.

Figure 2: Hardware design.

advantage, but is made up for by the fact that passive targets are an opto-isolated trigger input, constant-current LED driver and the
cheaper to build and maintain, because they do not require a power necessary circuitry for interfacing with the camera board’s TTL I/O
source. As long as the system is operating under controlled lighting pins. The device is powered by an external 12 VDC (1A) power
conditions, which is usually the case in an indoor environment, the supply and connects to a PC workstation via 6-pin firewire (1394a)
choice of near-IR strobelights (typically in the 750-950 nm range) cable, which can reach a length of up to 10 meters.
combined with optical band-pass filters helps to avoid user irritation Multiple cameras are shutter-synchronized from a 20 mA square-
and increases the speed and accuracy of the image segmentation by wave current loop signal generated by a synchronization unit (see
reducing unwanted contributions from ambient light sources. Figure 2b) with a built-in programmable oscillator. The synchro-
nization unit supports arbitrary phase-shifting between cameras,
Accuracy and Precision: While of lesser importance for fully syn- and can be locked onto an external stereo synchronization signal.
thetic virtual environments, closely registered AR overlays impose
rather stringent requirements on tracking precision and accuracy.
We consider a location resolution of less than 1.0 mm, with a mini-
mum accuracy of ± 5.0 mm, and an angular precision of better than
0.05 degrees as adequate for a wide range of AR applications.
Latency: It is commonly accepted that HMD rendering latencies
beyond 40-60 ms will start to negatively affect user performance,
possibly inducing symptoms of “simulator sickness” (i.e. fatigue,
loss of concentration). According to recent experiments [5], trained
users are able to discriminate variations of end-to-end latencies as
small as 15 ms. To be suitable for use with an HMD, a tracker’s Figure 3: L-shaped room calibration target, four-marker target,
intrinsic latency should under no circumstances exceed 30-40 ms. single-marker extrinsic calibration target and building materials.

Update rate: We attempted to match our tracker’s measurement


frequency to the update rates of the visual output devices it would 3.2 Targets
eventually be combined with. Suitable cameras within our price
We build our passive retro-reflective targets from self-adhesive
range were limited to a maximum of 30-60 frames per seconds.
sheets of 3M ScotchLite Silver Transfer Film which are cut into
Incidentally, the current generation of lower-end HMDs (e.g. eMa-
shape and wrapped around solid nylon (PA 66) spheres of varying
gin’s Z8002 ) offers refresh rates of no more than 60 Hz. Hence, no
diameters (between 12 and 20 mm, depending on the intended use
benefit would be gained from providing higher update rates.
of the target). The reflective spheres are subsequently attached to a
cone-shaped nylon base via thin carbon-fiber rods. We chose nylon
3 H ARDWARE D ESIGN
over other plastics for its good machinability and superior adhesive
3.1 Cameras properties, and carbon-fiber for its structural strength and resistance
Our most recent camera design (see Figure 2a) is based on the against breaking and permanent bending. Figure 3 shows the mate-
FireFly MV3 monochrome OEM board camera from Point Grey rials used, and resulting final targets.
Research, which is built around the Micron FFMV-03MTM 1/3”
global-shutter CMOS-sensor and capable of delivering a video res- 4 S OFTWARE A RCHITECTURE
olution of 640x480 at 60 frames per second. It has external trigger The diagram in Figure 4 depicts the organization of our system’s
and strobe functionality, and costs under USD 200. After attaching online image-processing pipeline, with the exception of camera cal-
a 3.6 mm f 2.0 microlens with an infrared band-pass filter (RG-780), ibration, which is treated as a separate offline step. Given a cali-
the camera module is mounted inside a 71x65x40 mm aluminum brated camera rig, the runtime behavior is as follows:
enclosure, next to a 60-chip LED-array with a peak wavelength After a new frame is received from an attached camera (Video In-
of 850 nm, radiant intensity of 0.4 W/sr (watts per steradian) and put), an image segmentation algorithm identifies the 2D coordinates
viewing half-angle of ± 60 degrees. The design further includes of every blob feature (i.e. screen-space projection of a 3D marker)
in the video image (Feature Segmentation). For every set of shutter-
2 http://www.3dvisor.com synchronized frames, the system optimally solves the feature cor-
3 http://www.ptgrey.com/products/fireflymv respondence problem across all views (Multiple-View Correlation).
determined far less frequently (ideally only once) than its exterior
Video Input File Input (Recording) position and orientation. Additionally, combined approaches tend
to be computationally more expensive.

5.1.1 Intrinsic Parameters


Network Source Feature Segmentation
We currently use a variant of Jean-Yves Bouguet’s well known and
widely-used Camera Calibration Toolbox for M ATLAB4 with an
extension5 for the automated corner extraction of non-symmetric
Multiple-View Correlation Network Sink checkerbox calibration patterns to calibrate our cameras’ intrinsic
parameters. The toolbox uses a pinhole camera model with non-
Projective Reconstruction linear radial and tangential distortion compensation which was in-
spired by Heikkilä [10].

Model-Fitting camera 1 camera 2

use prediction for


Pose Estimation hypothesis ranking

Predictive Filtering

Data Delivery

camera 3 camera 4

Figure 4: Runtime image-processing pipeline.

A 3D position is computed for each correlated 2D feature-tuple via


projective triangulation (Projective Reconstruction). The resulting
3D point cloud is examined to find sub-structures that are similar to
a pre-calibrated target (Model-Fitting). If a target is discovered, its (a) Projections of the calibration target in multiple views.
position and orientation with respect to the tracker’s world coordi-
nate system is estimated (Pose Estimation). In order to reduce pose 0.700
jitter and to compensate for the tracker’s intrinsic latency, the result- 3

average reprojection error (pixels)


0.622
ing pose estimate is fed into a recursive filter (Predictive Filtering).
Finally, the filtered pose estimate is handed to a middleware com- 2.5 0.545

ponent responsible for streaming the tracking data to clients over a 0.467
network (Data-Delivery).
Zroom (m)

2
0.389

5 A LGORITHMS AND I MPLEMENTATION 1.5 0.312

With the sole exception of intrinsic camera calibration, for which 0.234
1
we currently use a M ATLAB toolbox, our software framework was 0.156
written entirely in C++. In addition to modularity and platform-
0.5 0.079
independence, special emphasis has been placed on shared-memory 2
1
0 2 2.5 3
parallelizations of the core algorithms in order to take maximal −1
−2 −1 −0.5 0 0.5 1 1.5 0.008

advantage of modern multi-core CPU architectures. In some in- Yroom (m) Xroom (m)
stances, we made use of existing public-domain software libraries
(such as VRPN [6]), but for the most part, we re-implemented stan- (b) Reconstructed 3D-path of the calibration target. Coloring indicates the
dard textbook techniques. Since their in-depth discussion would average reprojection error at different points inside the volume.
lack particular merit, we refer the reader to the pertinent literature
[7, 8] on projective geometry for a detailed and exhaustive treat-
ment of the underlying mathematical theory. Noteworthy excep-
tions are multiple-view feature correlation and model-fitting, both
of which do not lend themselves to straight-forward parallel imple-
mentations. Our approach was to re-state those problems as combi-
natorial optimizations on graphs and use a parallel maximum-clique
enumeration algorithm to reduce the solution space. Unfortunately,
limited space precludes us from giving anything more than a brief
overview of those techniques.
(c) Exterior pose of the cameras (not drawn to scale).
5.1 Camera Calibration
Although several multiple-camera calibration methods exist that Figure 6: Extrinsic calibration (4 cameras).
simultaneously solve for intrinsic and extrinsic parameters (e.g.
[9, 10] and others), we chose to treat the two problems separately 4 http://www.vision.caltech.edu/bouguetj/calib_doc

– primarily because a camera’s intrinsic parameters will have to be 5 http://research.graphicon.ru/computer-vision


600

400
1000

500 200

0 0

−500
−200

−1000
−400
1
2
3
4 12
5 11 −600
6 9 10
7
8 6 7 8
Y−axis 9 5
X−axis −800
10 3 4
11
12 1 2

(a) Four shutter-synchronized video images. (b) Recovered centers of overlapping blobs (via circular Hough transform).

(c) 2D features and corresponding epipolar lines. (d) Reconstructed 3D marker positions. (e) Estimated 6-DOF pose of multiple rigid targets.

Figure 5: Stages of the data-processing pipeline: from raw video images to 6-DOF pose estimates.

5.1.2 Extrinsic Parameters tween the two sets is computed as outlined in Section 5.5.

In order to accurately estimate a camera rig’s exterior position 5.2 Feature Segmentation
and orientation, we implemented parts of the automated single-
point self-calibration technique published by Svoboda et al. [11] A fast, robust and accurate feature segmentation algorithm is essen-
in C++. Instead of using a laser-pointer, we move a single retro- tial to the stability of the subsequent projective reconstruction. We
reflective marker (“extrinsic calibration target”), as depicted in Fig- designed an adaptive algorithm that combines three standard seg-
ure 3, through the interaction volume, apply our segmentation al- mentation techniques (thresholding, luminance-weighted centroid
gorithm (see Section 5.2) and record its resulting screen-space co- computation, and circular Hough transform) to efficiently locate
ordinates. After discarding outliers using pair-wise RANSAC anal- the centers of all circular shapes in a monochrome image with-
ysis, the point cloud’s projective structure is computed by rank- out sacrificing accuracy. Fortunately, the bright circular pixel-blobs
factorization and refined through bundle adjustment. The final step formed by the light reflected from our spherical markers are ex-
is Euclidean stratification that converts the projective reconstruc- tremely easy to locate in an otherwise dark camera image by simple
tion into Euclidean structures. The minimal average reprojection means of binary thresholding. The application of a non-destructive
error (all cameras, all points) achieved by this method lies between block-wise threshold (via locical AND-operations) yields image re-
0.15 and 0.35 pixels. The running time for six cameras with pre- gions containing at least one bright (> 50% luminance) pixel. Only
calibrated intrinsic parameters, and an average of 2500 marker re- those regions are further examined by subjecting them to a weighted
projections per camera, is below five minutes. Figure 6 illustrates grayscale centroid computation, which recovers the center location
the calibration procedure. of an 8-connected blob to subpixel precision. We use a heuristic
(based on the bounding rectangle and sum of pixel luminances) to
In a final step, the affine transformation between the (arbi- select certain blobs for further analysis, if they are deemed likely
trary) coordinate frame returned by Svoboda’s algorithm and a to be composed of overlapping marker reprojections. In this case,
user-defined room coordinate system with real-world distance units a circular Hough transform [12] is applied to recover the center of
(mm) is computed. We place an L-shaped constellation of reflective every circular shape within a confined region-of-interest around the
markers (“room calibration target”), as shown in Figure 3, at the in- blob. Figure 5b shows a typical case of two markers, whose projec-
tended coordinate origin inside the working volume. The target’s tions overlap and their recovered center locations.
marker positions were previously measured to submillimeter accu-
racy with an industrial theodolite system (Leica TPS700). Using
5.3 Projective Reconstruction
a scale-invariant version of our model-fitting algorithm (described
in Section 5.4), we obtain the set of optimal correspondences be- Minimizing the square-sum of 2D image reprojection errors of a 3D
tween calibration target and observed point cloud. The scaling fac- point is widely accepted as the optimal formulation of the projec-
tor is calculated from the ratio of the sum of entries in both distance tive reconstruction problem [13] (also known as “L2-optimal trian-
matrices (over multiple frames), and a 3D rigid transformation be- gulation”). When the correspondence between projected points in
0 10 20 30 40 50 60 70 80 90 (mm)
1
pairwise marker
distances
(with uncertainties)

0
correlation-score
look-up table

look up correlation scores


compute distance matrix
100

(compute row sum)


accumulate scores
50

apply threshold
0

-50

-100

100
50
0
-50 -100
-50
-100 0
50
100

Reconstructed 3D point cloud Euclidean distance matrix Correlation score matrix

connect high-scoring candidate vertex- select entries (3D points) with above-
Without correlation-score accumulation: pairs with an undirected edge threshold values as candidate vertices

100 100

find maximal vertex-clique


50 50

0 0

-50 -50

-100 -100

100 100
0 0
-100 -50 -100 -100 -50 -100
100 50 0 100 50 0

Figure 7: Quadratic-time complexity reduction (top). Formulation of the model-fitting problem as maximum-clique search (bottom).

different views is initially unknown, as is the case with our system, spondence candidates. Optimal correspondences are then chosen
is becomes necessary to formulate a feature-correlation strategy. from the remaining candidates by ranking them according to their
reprojection error (computed via triangulation).
5.3.1 Multiple-View Feature Correlation
5.3.2 Triangulation
In order to reduce the combinatorial size of the correlation prob-
lem, we take advantage of an observation from epipolar geometry: Recovering a 3D point from multiple 2D correspondences in cal-
A (very) large algebraic distance between a 2D point in one im- ibrated camera images is a well-studied problem with multiple
age and the epipolar line corresponding to a point in another image straight-forward solutions (see [13] for a comprehensive survey).
(also known as “epipolar distance”, see Figure 5c) will also yield In our system, we chose the following approach: After obtaining an
(very) large reprojection errors after L2-optimal triangulation. This initial estimate via the standard “linear” SVD method, we perform
allows us to safely discard every n-tuple of 2D features (from differ- bundle adjustment [15] with a Levenberg-Marquardt nonlinear least
ent views), in which the epipolar distance between any two points squares algorithm.
exceeds a certain large threshold (usually in the range of several
pixels). Since the (squared) algebraic epipolar distance can be com- 5.4 Model-Fitting
puted very efficiently, this results in a significant reduction of pro- The basic idea behind our model-fitting algorithm is to formu-
cessing time. We modeled this optimization as search on a graph, late the underlying combinatorial problem in a way that allows
whose vertices correspond to the 2D features in all cameras. If we for the application of a maximum-clique search. Prior to run-
connect every vertex-pair, whose epipolar distance does not exceed ning the maximum-clique search (which is NP-hard), we apply a
said threshold with an undirected edge, we can apply a maximum- polynomial-time complexity-reduction heuristic that takes advan-
clique algorithm [14] to obtain a drastically reduced set of corre- tage of an important constraint: Since we are in control over how
our targets are designed, we can construct them in a way that opti- stopped, because we need at least three point-matches to es-
mizes the effectiveness of the complexity-reduction step (see Sec- timate the target’s orientation. Otherwise (if n > 3), we start
tion 5.4.3 for a detailed discussion). a new search for model-substructures of size (n − 1), which
involves adding previously discarded markers with an accu-
5.4.1 Complexity Reduction mulation score above Tn−1 = t ∗ (n − 2) to the graph.
Our complexity-reduction technique is based on a form of geomet-
ric hashing. At startup, we compute the Euclidean marker-distance The complexity-reduction procedure described in Section 5.4.1
matrix for every trackable target (i.e. rigid constellation of 3D needs to run only once for each frame, whereas the graph search is
markers) and store its entries in a one-dimensional look-up table repeated at least once per trackable target. We generally search for
(see Figure 7, first row). Measurement uncertainties are assumed to targets in descending order of their size (i.e. number of model-
follow a Gaussian distribution whose standard deviation is chosen markers), but prioritize targets that were discovered in a recent
to coincide with the tracker’s absolute accuracy (see Section 7.3). frame over those for which no recent observation exist. Once the
We refer to the entries stored in this look-up table as correlation solution to the model-fitting problem has been found for a specific
scores, because they represent the probability with which any given target, its corresponding markers are removed from the 3D point
marker-pair (or more precisely, their separating distance) is part of cloud before the search for the next target continues.
the same target.
At runtime, we calculate the Euclidean distance between every 5.4.3 Target Design
pair of observed 3D markers for a given frame. For every entry in The computational performance of our model-fitting algorithm de-
the Euclidean distance matrix (EDM), we look up its corresponding pends in large parts on how many candidates (i.e. edges and ver-
correlation score and store it in a separate Correlation score matrix tices) the complexity-reduction heuristic is able to eliminate prior
(CSM). From the CSM, we compute a vector containing the accu- to the graph search. Problems arise when multiple targets are too
mulated correlation scores for every observed marker through row- “similar” to each other, or if a single target has a high degree of
or column-wise summation. “self-similarity”. In this context, we define “similarity” as the
All of these operations can be carried out in a single parallel smallest difference between pair-wise marker distances across all
pass without any inter-thread coordination overhead. A visualiza- targets. For the target pictured in Figure 8, the similarity-measure
tion of the procedure is shown in Figure 7 (second row). As can would be equal to the minimum entry in the matrix of pair-wise
be seen in the image, the CSM will contain mostly near-zero en- distances between all di j .
tries, indicating that the majority of marker-pairs do not belong to
Suppose our targets were built to minimize this similarity-metric.
the same solution-tuple. In other words: the optimal solution to the
It follows that we simultaneously minimized the probability with
model-fitting problem is highly unlikely to contain both of the two
which any (random) marker-distance can at the same time occur
markers.
inside two or more targets. A visual assessment of the similarity-
Depending on the number (n) of model-markers, we also expect metric would require examining the extent of overlap between mul-
every member of the solution-set to have an accumulated correla- tiple targets’ correlation-score look-up tables, where little overlap
tion score of at least Tn = t ∗ (n − 1), where t ∈ [0.5, 1]. The eas- would indicate a low similarity.
iest way to enforce this constraint is to apply threshold Tn to the
accumulation-vector. Afterwards, every marker whose thresholded
entry is zero can be eliminated as a candidate. d12
According to our observations, this elimination procedure will
reliably reduce the number of candidate markers by up to 90% (in d23
a real-life scenario, not degenerate test-cases), without being prone L1 L2 d14 d13
to false negatives.
d24
5.4.2 Graph Search
L4 L3 d34
After complexity-reduction, the remaining candidate markers are
modeled as vertices in an initially unconnected graph. Every
vertex-pair whose correlation score lies above threshold t is con-
nected with an undirected edge. Figure 7 (bottom row) shows the
resulting graph structure. As can be seen, the optimal solution to the
model-fitting problem for a target of n markers is a vertex-clique of Figure 8: Target construction template.
size n, which can be obtained through a maximum-clique search.
We have to account for three possible outcomes of the search: Unfortunately, the way in which we build our targets does not al-
low for an arbitrary placement of markers. The position of a marker
• Single solution: The graph has exactly one vertex-clique of is determined by the length and angle of inclination of its mount-
size n. This is the ideal (and most common) case. No further ing rod. In order to simplify the manufacturing process, we use
processing is necessary. the same drill-hole positions and angles for multiple base mounts
(“target templates”), which leaves the variation of rod lengths as
• Multiple solutions: The graph has multiple vertex-cliques of the only means of marker-placement. In Figure 8, those lengths are
size n. We choose the solution with the minimal square-sum denoted L1..4 .
of point-wise model-alignment errors as “optimal fit”. If an The problem of determining the optimal rod lengths for a set of
estimate of the expected target pose is available (from a pre- targets amounts to a global maximization of the objective function
vious frame or predictive filter), we can also use linear and f (L1 , L2 , ..., Ln ) = minimum pair-wise distance (di , d j ). Although
angular distances from the expected pose as ranking criteria. there are are better global optimization techniques for this kind of
problem, we achieved very good results with a simple genetic algo-
• No solutions: The graph does not have any vertex-cliques rithm. In our experiment, the minimum difference of distances for
of size n, most likely due to occlusions or the absence of a set of 6 targets with a total of 28 markers came out to 3 mm (with
the target from the working volume. If n = 3, the search is a 0.05-quantile of 5 mm).
5.5 3D-3D Pose Estimation 50

Several closed-form solutions exist for computing the 3D rigid 45


transformation between two sets of point-correspondences. We 40
1-5 targets

intrinsic latency (ms)


chose to implement the approach developed by Arun et al. [16],
which is based on computing the singular-value decomposition of a 35
derived matrix and was found equivalent [17] to other closed-form
30
pose estimators (e.g. Horn’s quaternion-method [18]).
25
5.6 Predictive Filtering 20
6-10 targets
In order to reduce pose jitter and to compensate for the tracker’s in- 15
trinsic latency, we currently employ predictive double-exponential
smoothing [19], which is a viable alternative to (non-extended) 10
0 20 40 60 80 100 120 140 160
Kalman filtering. It is much faster, while performing only
total number of markers (all cameras)
marginally worse.
Figure 9: Intrinsic latency under varying computational load.
5.7 Data-Delivery
For easy integration with existing VR/AR frameworks, we chose to
base our system’s server module the on the public-domain tracking
0.2 0.267
middleware VRPN [6]. An extended VRPN protocol format lets

average reprojection error (pixels)


client applications access additional information, such as the aver-
0.15
age marker reprojection errors or 2D blob feature coordinates. 0.264

0.1
95% confidence bound
0.261

Z (mm)
6 D ISTRIBUTED P ROCESSING 0.05

When we wrote our tracking server software, we made two pro-


0 0.258
visions for a future scenario in which several networked tracking
systems would form an arrangement of multiple overlapping cells −0.05
covering larger indoor working volumes: 0.256

−0.1

−0.2 0.253
• Every server instance can synchronously serialize its entire in- 0.05 −0.15
−0.1
0 −0.05
ternal state (to a compressed buffer) at any point in the image- 0
−0.05 0.05
X (mm) 0.1
Y (mm)
processing pipeline. The runtime penalty for such an opera-
tion is approximately 1 ms.
Figure 10: Distribution of position estimates for a static marker.
• Every non-temporary variable can be (synchronously) read or
(asynchronously) written remotely via an XML-RPC commu-
nication protocol. 7.2 Static Jitter
Next, we placed a single marker near the center of the interaction
7 P RELIMINARY R ESULTS volume and recorded its reconstructed 3D position for 36 seconds,
resulting in 1800 measurements. The 3D scatter-plot in Figure 10
This section provides an initial assessment of our systems’s ac-
illustrates the spatial distribution of the reconstructed marker posi-
curacy, precision and latency for a room-sized four-camera setup.
tions, whose RMS distance from their mean-normalized center (see
Currently, our tracking server application runs on GNU/Linux (us-
Figure 11 for a histogram) equals 0.04 mm (σ = 0.02 mm). The
ing libdc1394) and Microsoft Windows (using either DirectShow or
95% confidence radius of the distribution lies at 0.07 mm.
Point Grey’s FlyCapture SDK for video acquisition). Unless other-
wise indicated, all experiments were conducted on a system with
two AMD Opteron 280 processors clocked at 2.4 GHz, running un- 600
number of measurements

der SUSE Linux 10.1. The extrinsic camera calibration at the time 500
of the experiments is the same as shown in Figure 6, yielding an ap- 400
proximate working volume of 2.75 x 2.75 x 2.25 m (about 17m3 ). 300

200
7.1 Latency 100

To determine our system’s intrinsic latency, we built a customized 0


0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
USB device (based on the FT2232D chip) to trigger the cameras distance from mean−normalized center (mm)
from within our tracking software and used the kernel’s internal
high-resolution clock to time the average interval between the start Figure 11: Static positional jitter of a single 3D marker.
of the shutter integration and the end of the 6-DOF pose estimation
for the current frame. For this experiment, the shutter integration
time was set to 1.5 ms. Figure 9 shows the latency measurements Analogous to the previous experiment, we placed a pre-
we obtained by moving a varying number of pre-calibrated 6-DOF calibrated 5-marker target inside the tracking area and recorded its
targets through the tracker’s working volume. The constellation estimated 6-DOF pose for 2000 frames without any prediction or
depicted in Figure 5 represents one such test case. filtering. As depicted in Figure 12, the RMS angular distance from
the mean-normalized pose was computed at 0.014° (σ = 0.007°). ACKNOWLEDGEMENTS
This work was funded in part by the Austrian Science Fund FWF
400 project P19265. The authors wish to extend a special note of ap-
number of occurrences

preciation to Christian Breiteneder for his support throughout the


300 project.
200 R EFERENCES
[1] G. Welch, G. Bishop, L. Vicci, S. Brumback, K. Keller, and
100
D. Colucci. The hiball tracker: high-performance wide-area tracking
for virtual and augmented environments. In VRST ’99: Proceedings of
0
0 0.01 0.02 0.03 0.04 0.05 0.06 the ACM symposium on Virtual reality software and technology, pages
shortest angular distance from mean−normalized pose (deg.) 1–10, New York, NY, USA, 1999. ACM Press.
[2] E. Foxlin and L. Naimark. Vis-tracker: A wearable vision-inertial
Figure 12: Static angular jitter of 5-marker target. self-tracker. In Proceedings of the IEEE Conference on VR, 2003.
[3] M. Ribo. State of the art report on optical tracking. Technical Report
TR-VRVis-2001-025, 2001.
[4] H. Kaufmann and D. Schmalstieg. Mathematics and geometry edu-
7.3 Accuracy cation with collaborative augmented reality. Computers & Graphics,
To assess the tracker’s absolute accuracy, we rigidly attached two 27(3):339–345, Jun 2003.
markers to the ends of a square-cut aluminum rod of 0.4 m length [5] K. Mania, B. D. Adelstein, S. R. Ellis, and M. I. Hill. Perceptual
sensitivity to head tracking latency in virtual environments with vary-
and surveyed their exact distance to submillimeter accuracy (using
ing degrees of scene complexity. In APGV ’04: Proceedings of the
a theodolite). We then moved the rod through our tracking volume
1st Symposium on Applied perception in graphics and visualization,
and recorded the variations in distance measurements between the pages 39–47, New York, NY, USA, 2004.
two markers for 9700 subsequent frames. The resulting RMS dis- [6] R. M. Taylor, T. C. Hudson, A. Seeger, H. Weber, J. Juliano, and A. T.
tance error was measured at 5.3 mm (σ = 5.2 mm). The histogram Helser. VRPN: a device-independent, network-transparent VR periph-
in Figure 13 shows the distribution of distance errors. eral system. In Proceedings of the ACM Symposium on Virtual Reality
Software and Technology, pages 55–61, 2001.
2000 [7] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer
number of occurrences

Vision. Cambridge University Press, second edition, 2004.


[8] O. Faugeras, Q. T. Luong, and T. Papadopoulou. The Geometry of
Multiple Images: The Laws That Govern The Formation of Images of
1000
A Scene and Some of Their Applications. MIT Press, Cambridge, MA,
USA, 2001.
[9] X. Chen, J. Davis, and P. Slusallek. Wide area camera calibration
0 using virtual calibration objects. In Proceedings of the IEEE Confer-
−30 −25 −20 −15 −10 −5 0 5 10 15
difference between observed and expected distance (mm) ence on Computer Vision and Pattern Recognition (CVPR), volume 2,
pages 520–527, 2000.
[10] J. Heikkilä and O. Silvén. A four-step camera calibration procedure
Figure 13: Accuracy of distance measurements. with implicit image correction. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR), pages
1106–1112, 1997.
8 F UTURE W ORK [11] T. Svoboda, D. Martinec, and T. Pajdla. A convenient multi-camera
self-calibration for virtual environments. PRESENCE: Teleoperators
Ideas for the improvement of our system are manifold: Due to the and Virtual Environments, 14(4):407–422, August 2005.
significant cost-reduction, a range of new VR/AR applications be- [12] C. Kimme, D. Balard, and J. Sklansky. Finding circles by an array of
comes affordable and viable. Medium-term plans include the use accumulators. Commun. ACM, 18(2):120–122, 1975.
of a larger number of cameras (> 20) in an arrangement of multi- [13] R. Hartley and P. Sturm. Triangulation. Journal of Computer Vision
ple overlapping cells to cover larger indoor volumes. We envision and Image Understanding, 68(2):146–157, 1997.
the use of multiple networked PCs with dynamic load-balancing, [14] P. R. J. Östergård. A fast algorithm for the maximum clique problem.
because there are architectural limits to the number of cameras that Journal of Discrete Applied Mathematics, 120(1-3):197–207, 2002.
can be attached to a single PC. However, on a recently acquired [15] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle
workstation equipped with two Intel Xeon DP X5355 processors adjustment – A modern synthesis. In W. Triggs, A. Zisserman, and
(a total of 8 cores), the mainboard theoretically allows to attach 36 R. Szeliski, editors, Vision Algorithms: Theory and Practice, LNCS,
cameras (with specifications as described in 3.1) using 2 on-board pages 298–375. Springer Verlag, 2000.
and 4 quad-channel add-on 1394-controllers. [16] K. S. Arun, T. S. Huang, and S. D. Blostein. Least-squares fitting
Additionally, we plan to investigate whether general-purpose of two 3-D point sets. IEEE Transactions On Pattern Analysis And
Machine Intelligence, 9:699–700, 1987.
computing on modern GPUs would allow us to increase the per-
[17] D. W. Eggert, A. Lorusso, and R. B. Fisher. Estimating 3-d rigid body
formance of core algorithms.
transformations: a comparison of four major algorithms. Journal of
Machine Vision and Applications, 9(5-6):272–290, 1997.
9 C ONCLUSION [18] B. K. P. Horn. Closed-form solution of absolute orientation using unit
This paper laid out the hardware and software-design for a low-cost quaternions. Journal of the Optical Society of America, 4(4):629–642,
infrared-optical pose-tracking system. Our goal was to minimize April 1987.
the cost of the system without sacrificing speed or accuracy. It [19] J. LaViola. Double exponential smoothing: an alternative to kalman
is our expressed intention to make our system available not only filter-based predictive tracking. In Proceedings of the Workshop on
to other members of the academic community, but also to artists, Virtual Environments (EGVE ’03), pages 199–206, New York, NY,
USA, 2003.
game designers and small commercial application developers with
an interest in immersive Virtual and Augmented Reality.

View publication stats

You might also like