[go: up one dir, main page]

0% found this document useful (0 votes)
11 views10 pages

Tracking by Instance Detection - A Meta-Learning Approach

This document presents a meta-learning approach to visual object tracking, treating it as an instance detection problem. The authors propose a three-step method using model-agnostic meta-learning (MAML) to convert modern object detectors into high-performance trackers, resulting in two trackers named Retina-MAML and FCOS-MAML. Evaluations show that both trackers achieve competitive performance on various benchmarks, operating in real-time at 40 FPS.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views10 pages

Tracking by Instance Detection - A Meta-Learning Approach

This document presents a meta-learning approach to visual object tracking, treating it as an instance detection problem. The authors propose a three-step method using model-agnostic meta-learning (MAML) to convert modern object detectors into high-performance trackers, resulting in two trackers named Retina-MAML and FCOS-MAML. Evaluations show that both trackers achieve competitive performance on various benchmarks, operating in real-time at 40 FPS.

Uploaded by

ranaimransa227
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Tracking by Instance Detection: A Meta-Learning Approach

Guangting Wang1 Chong Luo2 Xiaoyan Sun2 Zhiwei Xiong1 Wenjun Zeng2
University of Science and Technology of China1 Microsoft Research Asia2
wgting96@gmail.com {cluo, xysun, wezeng}@microsoft.com zwxiong@ustc.edu.cn

Abstract
Training image

We consider the tracking problem as a special type of


object detection problem, which we call instance detection.
With proper initialization, a detector can be quickly con- Domain Adaptation
verted into a tracker by learning the new instance from a
single image. We find that model-agnostic meta-learning
(MAML) offers a strategy to initialize the detector that satis- Before adaptation Test image After adaptation

fies our needs. We propose a principled three-step approach


to build a high-performance tracker. First, pick any modern
object detector trained with gradient descent. Second, con-
duct offline training (or initialization) with MAML. Third,
perform domain adaptation using the initial frame. We fol-
Figure 1: MAML provides an effective way to initialize an
low this procedure to build two trackers, named Retina-
instance detector. With a single training image, the detector
MAML and FCOS-MAML, based on two modern detec-
can be quickly adapted to the new domain (instance). It is
tors RetinaNet and FCOS. Evaluations on four benchmarks
capable of locating target object in subsequent frames even
show that both trackers are competitive against state-of-
when the object has significant appearance changes.
the-art trackers. On OTB-100, Retina-MAML achieves the
highest ever AUC of 0.712. On TrackingNet, FCOS-MAML
ranks the first on the leader board with an AUC of 0.757 was proposed in Faster R-CNN detector [30], has been
and the normalized precision of 0.822. Both trackers run in adopted in SiamRPN tracker and its variants [21, 20, 43].
real-time at 40 FPS. The introduction of multi-aspect-ratio anchors solves the
box estimation problem that has been plaguing previous
trackers. It has greatly improved the performance of
1. Introduction siamese-network-based trackers. More recently, the IoU
Given a bounding box defining the target object in the network [14], which is again an innovation in object de-
initial frame, the goal of visual object tracking is to auto- tection, is applied to object tracking by ATOM and DiMP
matically determine the location and extent of the object in [6, 3] and demonstrates powerful capabilities.
every frame that follows. The tracking problem is closely In addition to these approaches that borrow advanced
related to the detection problem, and it even can be treated components from object detection to assemble a tracker, we
as a special type of object detection, which we call instance believe that another option is to directly convert a modern
detection. The major difference is that object detection lo- object detector into a high-performance tracker. This will
cates objects of some predefined classes and its output does allow the tracker to retain not only the advanced compo-
not differentiate between intra-class instances. But object nents but also the overall design of the base detector. The
tracking only looks for a particular instance, which may be- main challenge is how to obtain a good initialization of the
long to any known or unknown object class, that is specified detector so that once a new instance is given, it can ef-
in the initial frame. ficiently infuse the instance information into the network
Given the similarity between the two tasks, some object without overfitting. Fig.1 illustrates the idea. The detector
detection techniques are used extensively in object tracking. may behave like a general object detector before adaptation.
For example, the region proposal network (RPN), which But after domain adaptation with a single training image, it
is able to “memorize” the target and correctly locate the tar-
This work was done while Guangting was an intern with MSRA. get in subsequent frames. A recent work by Huang et al.

6288
[13] shares a similar vision with us but they still treat track- offline-trained CNN. A cross-correlation operation is then
ing as a two-step task, namely class-level object detection adopted to compute the matching scores. A main draw-
and instance-level classification. In the first sub-task, a tem- back of SiamFC is that it only evaluates the candidates with
plate image is involved and a separate branch is employed the same shape as the initial box. SiamRPN [21] solves
to process the template. this problem by borrowing the RPN idea from object de-
In this work, we are looking for a neat solution to realize tectors. Later, SPM-Tracker [36] borrows the architecture
our idea. The constructed tracker will just look like a nor- from two-stage detectors and achieves an improved perfor-
mal detector, without additional branches or any other mod- mance. Currently, the best-performing trackers in this cate-
ifications to the network architecture. We find that model- gory are ATOM [6] and DiMP [3], which leverage the most
agnostic meta learning (MAML) [10] offers a learning strat- advanced IoUNet [14] for precise object localization.
egy with which a detector can be initialized as we de- Template-based methods usually run very fast, because
sire. Based on MAML, we propose a three-step procedure the CNN used to extract features does not need to be on-
to convert any modern detector into a high-performance line updated. However, as the tracking proceeds, new target
tracker. First, pick any detector which is trained with gradi- appearances should be integrated into the template for a bet-
ent descent. Second, use MAML to train the detector on a ter performance. But most methods lack an effective model
large number of tracking sequences. Third, when the initial for template online updating. This limitation has created a
frame of a test sequence is given, fine-tune the detector with performance ceiling for template-based trackers.
a few steps of gradient descent. A decent tracker can be ob- The other category is template-free methods [27, 28, 15],
tained after this domain adaptation step. During tracking, which intend to store the target appearance information
when new appearances of the target are collected, the de- within the neural network, in the form of fine-tuned param-
tector can be trained with more samples to achieve an even eters. The challenge in designing template-free trackers is
better adaptation capability. how to quickly infuse the instance information to the net-
Following the proposed procedure, we build two in- work without overfitting. MDNet [28] divides the CNN
stance detectors, named Retina-MAML and FCOS-MAML, into shared layers and domain-specific layers. The shared
based on advanced object detectors RetinaNet [24] and layers provide a reasonable initialization and the domain-
FCOS [34]. During offline training, we further introduce specific layers are online trained with the new instance. Due
a kernel-wise learnable learning rate in MAML to improve to the limitation of the conventional training strategy, MD-
the expressive ability of gradient based updating. Evalua- Net takes many iterations to converge and fewer iterations
tions of the trackers are carried out on four major bench- cause serious performance degradation. As a result, MDNet
marks, including OTB, VOT, TrackingNet and LaSOT. Sys- is too slow to be used in real-time scenarios.
tem comparisons show that both trackers achieve compet- We find that template-free trackers are neat solutions.
itive performance against state-of-the-art (SOTA) trackers. They do not need to maintain an external template and the
On OTB-100, Retina-MAML and FCOS-MAML appear to network architecture looks just like a detector. Domain
be the best-performing trackers, with AUCs of 0.712 and adaptation and online update can be achieved by a unified
0.704, respectively. Retina-MAML achieves an EAO of online training procedure. However, it is still quite chal-
0.452 on VOT-2018. FCOS-MAML achieves an AUC of lenging to achieve a good performance-speed tradeoff for
0.757 on TrackingNet, ranking number one on the leader this type of trackers.
board. Additionally, both trackers run in real-time at 40
FPS. 2.2. Meta learning and its application to tracking
The goal of meta-learning is to train a model on a variety
2. Related Work of learning tasks, such that it can solve new learning tasks
using only a small number of training samples [10]. When
2.1. CNN-based visual object tracking
we view object tracking as an instance detection task, the
With the great success of deep learning and convolu- tracker is trained on a variety of instance detection tasks
tional neural networks (CNN) in various computer vision so that it can quickly learn how to detect a new instance
tasks, there emerge an increasing number of CNN-based using only one or a few training samples from the initial or
trackers. We divide CNN-based trackers into two cate- previous frames. We find that the tracking task is a perfect
gories, depending on whether an explicit template is used. example to apply meta-learning.
Most siamese-network-based trackers [2, 21, 20, 36] fall Model-agnostic meta-learning (MAML) [10] is an im-
into the first category, which we call template-based meth- portant algorithm for meta learning. It helps the network
ods. The target appearance information is stored in an to learn a set of good initialization parameters that are suit-
explicit template. In SiamFC [2], features are extracted able for fine-tuning. During training, the parameters of the
from the template and the search region using the same model are explicitly trained such that a small number of

6289
Training loss 1 Training loss 2 Training loss N

Update Update Update

Params Update

Test loss 0 Test loss 1 Test loss N-1 Test loss N

Forward computation Inner-level gradients Outer-level gradients


Figure 2: Illustration of our training pipeline. The first row is the inner training loop. A few steps of SGD optimization is
performed on the support images. The updated parameters after each step are used for calculating the meta-gradient based
on testing images. Best viewed in colors.

gradient steps with a small amount of training data from tor, so that it can quickly adapt to a new instance when only
a new task will produce good generalization performance the initial frame is available. In this section, we present
on that task. The most striking merit of MAML is that it the approach to learn an instance detector with MAML.
is compatible with any model trained with gradient descent The complete steps to construct a tracker will be detailed
and applicable to a variety of different learning problems. in the next section. The training data in this learning step
Because of this, MAML is a perfect candidate to realize are videos with ground-truth labeling of the target object on
our idea, which is to convert any advanced object detec- each frame.
tors (trained with gradient descent) into a tracker. Later, Formally, given a video Vi , we collect a set of training
MAML++ [1] introduces a set of tricks to stabilize the train- samples, denoted by Dis . It is also called the support set
ing of MAML. MetaSGD [23] proposes to train learnable in meta learning. A detector model is defined as h(x; θ 0 ),
learning rates for every parameter. In the area of object where x is the input image and θ 0 is the parameters of the
tracking, Meta-Tracker [29] is the first to use MAML for detector. We update the detector on the support set by a
the domain adaptation step of MDNet. MetaRTT [16] fur- k-step gradient descent (GD) algorithm:
ther applies MAML for the online updating step. Basically,
their main purpose is to accelerate the online training of ex- θ k ≡ GDk (θ 0 , Dis ) , and
isting trackers, including MDNet [28], CREST [31] and RT- 1 X
MDNet [15]. We argue that, since meta learning provides a θ k = θ k−1 − α s ∇θk−1 L(h(x; θ k−1 ), y),
|Di |
(x,y)∈Dis
mechanism to quickly adapt a deep network to model a par-
ticular object and avoid overfitting, why not directly con- (1)
vert a modern object detector into a tracker, instead of mak-
ing a slow tracker faster? Huang et al. [13] have the same where L is the loss function and (x, y) is a data-label pair in
idea. They propose to learn a meta layer in detection head the support set. The procedure in Eqn. (1) is called inner-
by MAML. However, they still introduce a template in the level optimization. To evaluate the generalization ability
first part of the tracker called class-level object detection. of the trained detector, we collect another set of samples
The complex design results in a slow speed. Dit from the same video Vi and they are called the target
set. We calculate the loss on the target set by applying the
trained detector, which can be written as:
3. Learning an Instance Detector with MAML
1 X
The key to convert a detector into an instance detector F (θ 0 , Di ) = L(h(x; θ k ), y) (2)
(a tracker) is to provide a good initialization of the detec- |Dit |
(x,y)∈Dit

6290
where Di = {Dis , Dit } denotes the combined support set where α is a tensor which has the same size as θ k . No-
and target set. The overall training objective is to find a tation ⊙ denotes the element-wise product. However, set-
good initialization status θ 0 for any tracking video. It can ting up a learning rate for every parameter will double the
be formulated as: model size. In contrast, we arrange the learnable learning
rates in a kernel-wise manner. Specifically, for a convolu-
1 X
N
tion layer with Cout output channels, we define a learning
θ ∗ = arg min F (θ 0 , Di ), (3)
θ0 N i rate for each convolutional kernel and this only introduces
an additional number of Cout learnable parameters, which
where N is the total number of videos. The procedure in are negligible in the model.
Eqn. (3) is called the outer-level optimization, which can
be solved by gradient-based methods like Adam [18]. The 4. Retina-MAML and FCOS-MAML
outer-level gradients are back-propagated through the inner- This section provides the details of the proposed three-
level computational graph. The only assumption about the step procedure to build a tracker. Specifically, we will
detector h is that it is differentiable. Therefore, this ap- present detector choices, offline training details, and the on-
proach is readily applicable to most deep learning based line tracking process for two trackers named Retina-MAML
detectors. and FCOS-MAML.
Fig. 2 illustrates this training pipeline. In the training
phase, we only sample a pair of images from the dataset. 4.1. Detectors
Following the practice in DaSiamRPN [43], these two im-
As MAML is a model-agnostic learning approach, we
ages may come from either the same sequence or differ-
are free to choose any modern detector trained with gradient
ent sequences. The first image will be zoomed in/out by a
descent as the base to build a tracker. As the first attempt in
constant factor (1.08 in our experiments) so that a support
this direction, we choose two single-stage detectors which
set with three images is constructed for the inner-level op-
run faster and are fairly easy to manipulate than their two-
timization. The second image is viewed as the target set
stage counterparts. However, we do not see any obstacles in
with single image for calculating the outer-level loss. We
using two-stage detectors in our approach.
use a 4-step GD for the inner-level optimization and Adam
Single-stage detectors are usually composed of a back-
solver [18] for the outer-level optimization. To stabilize the
bone network and two heads, namely classification head and
training and strengthen the power of detector, we make the
regression head. The backbone network generates feature
following modifications to the original MAML algorithm.
maps for the input image. Based on the feature maps, the
Multi-step loss optimization. MAML++ [1] proposes objects are scored and localized.
to take the parameters after every step of inner-level GD to RetinaNet [24] is a representative single-stage object de-
minimize the loss on target set, instead of only using the tector. Each pixel in the feature maps is associated with sev-
parameters after final step. Mathematically, Eqn. (2) can be eral predefined prior boxes, or anchors. The classification
re-written into: head is trained to classify whether each anchor has a suffi-
X X
K cient overlap with an object. The regression head is trained
1
F (θ0 , Di ) = γk L(h(x; θ k ), y), (4) to predict the relative differences between each anchor and
|Dit | the corresponding ground-truth box. Similar design can be
(x,y)∈Dit k=0
found in many existing detectors, which are grouped into a
where K is the number of inner-level steps and γk is the loss family of anchor-based detectors.
weight for each step. Note that our formulation is slightly Recently, the concept of anchor-free detection has re-
different from that in MAML++. The initialization param- ceived a lot of attention. As the name suggests, no anchor
eter θ 0 (before updating) also contributes to the outer-level is defined. FCOS [34] is a representative detector in this
loss. In our experiments, we find this trick is crucial for category. After the backbone network generates the feature
stabilizing the gradients. maps, the classification head is trained to classify whether
Kernel-wise learnable learning rate. In standard each pixel in the feature maps is within the central area of an
MAML, the learning rate α in the inner-level optimization object. Meanwhile, the regression head directly estimates
is a predefined constant. MetaSGD [23] proposes to specify the four offsets from the pixel to the object boundaries. Fig.
a learnable learning rate for each parameter in the model. 3 depicts the core design difference between anchor-free
Therefore, the GD algorithm in Eqn. (1) can be re-written and anchor-based detectors.
into: Next, we make some simplifications to the chosen detec-
X tors RetinaNet and FCOS. These simplifications improve
1
θ k+1 = θ k − α ⊙ ∇θk L(h(x; θ k ), y), (5) the tracker’s speed but will not affect the tracking perfor-
|Dis | mance. We believe so because visual object tracking is per-
(x,y)∈Dis

6291
Cls. Branch

Shared layers
Reg. Branch
(a) Anchor-based detector (b) Anchor-free detector Offline Online
Frozen
trainable trainable

Figure 3: (a) Anchor-based detectors predict the relative dif- Figure 4: We adopt ResNet-18 as the backbone. The first
ferences between the anchor and the ground-truth box. The three blocks are frozen after ImageNet pre-training and
dotted yellow box represents an anchor. (b) Anchor-free de- block-5 is removed. Block-4 is independently trained in
tectors directly estimate four offsets from the pixel to object the classification branch and the regression branch during
boundaries. offline training. Online training only involves a subset of
trainable layers.

formed frame-by-frame on a video sequence. Subsequent 4.2. Offline MAML training


video frames have strong temporal correlations, so the lo-
Loss definition. For Retina-MAML, an anchor box is
cation and extent of the target object in the previous frame
assigned a positive (or negative) label when its intersection-
provide a close estimate of those in the current frame. Usu-
over-union (IoU) overlap with the ground-truth box is
ally, tracking is performed on a square-shaped search re-
greater than 0.5 (or less than 0.3). We use focal loss and
gion, which is further scaled to a fixed size before being
smooth L1 loss to train the classification branch and regres-
passed to the tracking network. From the tracker’s point
sion branch, respectively. For FCOS-MAML, we adopt L2
of view, the size distribution of target object is very concen-
loss to supervise the training of centerness scores. The loss
trated. Therefore, it is not necessary to use the FPN module,
function in regression branch is L1 loss.
which is mainly adopted to handle large scale variations, in
Training data. Following other modern trackers [6, 3],
RetinaNet and FCOS. Additionally, the vanilla-version of
we use four datasets for offline training, namely MS-COCO
FCOS uses three network heads, one common regression
[25], GOT10k [12], TrackingNet [26] and LaSOT-train [8].
head and two centerness/classification heads. Since track-
In LaSOT and TrackingNet, we only sample one frame for
ing only needs to classify target and non-target, we only
every three or ten frames. The training images are cropped
keep the centerness branch to produce classification scores.
and resized into a resolution of 263 × 263. Standard data
The second step is to initialize a detector with offline augmentation mechanisms like random scaling and shifting
MAML training. As the detailed algorithm has been intro- are adopted.
duced in the previous section, we provide implementation Optimization. As noted in Section 3, we use 4-step
details here. GD for inner-level optimization during offline training pe-
Network architecture. Fig. 4 depicts the detection net- riod. The kernel-wise learnable learning rate α is initialized
work we use for MAML training. In both detectors, the to 0.001. The multi-step loss weights γk are initialized as
CNN backbone used for feature extraction is ResNet-18. equal contribution and gradually anneal to (0.05, 0.10, 0.2,
The parameters in the first three blocks are pre-trained with 0.30, 0.35), giving more weight and attention to later steps.
ImageNet and frozen during offline training. The last block For the outer-level optimization, we adopt Adam optimizer
(block-5) is discarded so that the stride to output feature [18] with a starting learning rate 0.0001. In each iteration,
maps is 8. We make two independent copies of block-4 and 32 pairs of images are sampled. The detector is trained for
put them to the respective branches. This is not a neces- 20 epochs, with 10,000 iterations per epoch. To accelerate
sary treatment for our approach to work, just to allow us the training, we use first-order approximation [1] in the first
to analyze the effect of online updating during tracking. 15 epochs.
For RetinaNet, we pre-define a single anchor box with a
4.3. Online training and tracking
size of 64 × 64 pixels. In our experiments we find that
this single-anchor setting performs slightly better than the The third step is domain adaptation when a new video
multi-anchor setting in SiamRPN [21]. sequence is given. In the initial frame, the instance to be

6292
Algorithm 1 Online tracking algorithm Domain OTB-100 VOT-18 LaSOT TrackingNet
Detector
Adaptation (AUC) (EAO) (AUC) (AUC)
Input: Frame sequence {Ii }N
i=1 ,detector h(·; θ), initialization
before 0.460 0.137 0.391 0.601
bounding box B1 , update interval u. Baseline
after 0.487 0.174 0.391 0.634
Output: Tracking results {Bi }N i=1
before 0.464 0.162 0.387 0.626
1: Generate search region image. S1 ← SR(I1 , B1 ) MAML
after 0.671 0.341 0.511 0.743
2: Initialize the support set. D s ← {DataAug(S1 )}
3: Model update in Eqn. (1). θ ← GD5 (θ, D s )
Table 1: MAML training allows a detector to quickly adapt
4: for i = 2, ..., N do
5: Detect objects represented in bounding box and score.
to a new domain, and therefore is the key in turning a detec-
j
{Bdet , c j }M tor into a tracker.
j=1 ← h(SR(Ii , Bi−1 ); θ)
6: if all cj < 0.1 then
7: Bi ← Bi−1 1.0 MAML 3.0
Baseline
8: continue 2.5
0.8
9: end if
2.0

Training loss
j
10: Add penalties and window priors to {Bdet , cj }M

Test loss
j=1 0.6 MAML
11: Select the box with the highest score c∗ . Bi ← Bdet ∗
1.5 Baseline

12: Linear interpolate shape. Bi ← Inter(Bi , Bi−1 ) 0.4


1.0
13: Update the support set Ds .
0.2 0.5
14: if i mod u = 0 or distractor detected then
15: Model update in Eqn. (1). θ ← GD1 (θ, Ds ) 0 5 10 15
Number of steps
20 25 0 5 10 15
Number of steps
20 25
16: end if
17: end for (a) Loss curve

Baseline Step 5 Baseline Step 20 Baseline Step 5 Baseline Step 20

tracked is indicated by a ground-truth bounding box. We


generate a patch with resolution 263 × 263 according to the
Training imagse

Test image
given bounding box. As with the offline training, we also
MAML Step 5 MAML Step 20 MAML Step 5 MAML Step 20
adopt zoom in/out data augmentation to construct the sup-
port set. The tracker is updated by a 5-step GD as described
in Eqn. (5).
After domain adaptation, the detector is now capable of
tracking the target object in subsequent frames. For each (b) Visualization
search region patch, the detector locates hundreds of can- Figure 5: Comparison of the MAML detector and the base-
didate bounding boxes, which are then passed to a stan- line detector during domain adaptation. (a) Quantitative
dard post-processing pipeline as suggested in SiamRPN losses on the training image and a testing image. (b) Visu-
[21]. Specifically, shape penalty function and cosine win- alization of the corresponding score maps. MAML detector
dow function are applied to each candidate. Finally, the can- convergences quickly and has strong generalization ability.
didate box with the highest score is selected as the tracking
result and its shape is smoothed by a linear interpolation KLLR in OTB-100 VOT-18 LaSOT TrackingNet
with the result in the previous frame. cls. reg. (AUC) (EAO) (AUC) (AUC)
During tracking, the support set is gradually enlarged. 0.628 0.313 0.490 0.733
The tracker can be online trained at a pre-defined interval X 0.661 0.368 0.502 0.737
based on the updated support set. This process is often X 0.676 0.315 0.504 0.744
called online updating in tracking. If a tracking result has X X 0.704 0.392 0.523 0.757
a score above a predefined threshold, it will be added into
the support set. We buffer at most 30 training images in the Table 2: Ablation analysis of kernel-wise learnable learning
support set. Earlier samples, except the initial one, will be rate. Cls. and reg. denote the classification branch and the
discarded when the number of images exceeds the limit. Af- regression branch, respectively.
ter every n frames (n = 10 in our implementation) or when
a distracting peak is detected (when the peak-to-sidelobe
is greater than 0.7), we perform online updating. In this 5. Experiments
case, we only use 1-step GD to maintain a high tracking
5.1. Ablation study
speed. On average, our tracker can run at 40 FPS on a single
NVIDIA P100 GPU card. The online tracking procedure is Meta-learning is the key in turning a detector into a
summarized in Alg. 1. tracker. In a nutshell, an instance detector can be built by of-

6293
Success plots of OPE − OTB100 Precision plots of OPE − OTB100
Online OTB-100 VOT-18 TrackingNet LaSOT Speed 1 1
0.9 0.9
cls. reg. (AUC) (EAO) (AUC) (AUC) (FPS) 0.8 0.8
Retina−MAML [0.712]
0.671 0.341 0.743 0.511 85 0.7 0.7

Success rate
FCOS−MAML [0.704]

Precision
0.6 SiamRPN++ [0.696] 0.6 Retina−MAML [0.926]
X 0.690 0.394 0.747 0.523 58 0.5 ECO [0.691]
SPM [0.687]
0.5 VITAL [0.918]
SiamRPN++ [0.915]
0.4 0.4 ECO [0.910]
X X 0.704 0.392 0.757 0.496 42 0.3
DiMP [0.686]
VITAL [0.682] 0.3
MDNet [0.909]
FCOS−MAML [0.905]
0.2 MDNet [0.678] 0.2
SPM [0.899]
DiMP [0.899]
ATOM [0.667] MetaTracker [0.880]
0.1 0.1
Table 3: Ablation analysis of the online updating strat- 0
0
MetaTracker [0.658]
0.2 0.4 0.6 0.8 1
0
0 10 20
ATOM [0.879]
30 40 50
Overlap threshold Location error threshold
egy. The baseline tracker without online updating achieves
a good performance-speed tradeoff. Online updating both Figure 6: The success plot and precision plot on OTB-100.
branches is the best choice for tracking short sequences.

5.1.2 Kernel-wise learnable learning rate


fline MAML training and domain adaptation (online train-
ing of the initial frame), and online updating further boosts The model learns information about the target objects from
the performance. In this section, we use FCOS-MAML to gradients. We propose to use learnable learning rates
carry out the ablation study, which is centered on offline (KLLR) in a kernel-wise manner. These learning rates
MAML training and online updating. The experiments are guide the directions of gradients and strengthen the power
conducted on four tracking benchmarks [39, 19, 8, 26], fol- for our model. In this section, we train several FCOS-
lowing the official evaluation protocols. MAML detectors with or without KLLR. Experimental re-
sults in Table 2 show that the model can benefit from KLLR
in both classification branch and regression branch.
5.1.1 Offline MAML training
Without MAML training, one could train a general object 5.1.3 Online updating strategy
detector with standard gradient descent. However, such a
detector is not capable of domain adaptation with only a few Our trackers perform two types of online training, one on
steps of updating using the samples from the initial frame. the initial frame for domain adaptation and the other on the
To demonstrate the importance of MAML training, we collected samples during tracking. The latter is known as
offline train the FCOS detector with standard GD and online updating. While domain adaptation is a must-have
MAML on the same dataset. They are called baseline de- training procedure for instance detectors, online updating
tector and MAML detector in this subsection. The perfor- is optional. We first evaluate the simplest baseline, which
mance is presented in Table 1. Without domain adaptation, does not perform online updating at all. Surprisingly, this
both detectors perform poorly in the tracking task. This is scheme achieves competitive performance on all the four
natural because they do not remember any information of benchmarks, as shown in Table 3. This version of FCOS-
the tracking target. However, after domain adaptation with MAML can run very fast at up to 85 FPS. When online
a 5-step GD, the MAML detector shows an clear advantage updating is adopted, FCOS-MAML achieves increased per-
over the baseline detector. The AUC on OTB-100 is greatly formance with slightly reduced speed. Comparing the last
improved from 0.464 to 0.671. In contrast, the baseline de- two rows, we have an interesting finding that is contrary
tector only slightly benefits from domain adaptation. to the conventional wisdom. It was previously believed
We can get a more intuitive impression of the two de- that online updating the regression branch may harm the
tectors from Fig. 5. Fig. 5(a) shows the loss curve of tracker’s performance due to the aggregated errors. How-
the two detectors during domain adaptation. Note that both ever, our results show that, except for the LaSOT dataset
detectors use the same GD algorithm in this process, but which is composed of very long sequences, online updating
the MAML detector has a much better adaptation capabil- both branches seems to be the best choice.
ity. For the training image, the loss of the MAML detector
5.2. Comparison with SOTA Trackers
quickly drops to a small value after only one-step GD up-
dating. The convergence speed of the baseline detector is Evaluation on OTB: We evaluate both our trackers
much slower and the loss is still large after 20 steps of up- FCOS-MAML and Retina-MAML on OTB 2013/50/100
dating. The right of Fig. 5(a) shows that the loss on the benchmarks [39]. We follow the one pass evaluation (OPE)
testing image even rises as the training proceeds. Fig. 5(a) protocol, and report the AUC scores of success plot. Table
visualizes the response maps on the training and testing im- 4 compares our trackers with some recent top-performing
ages generated by the two detectors. The MAML detector trackers. On OTB-100, FCOS-MAML and RetinaNet-
clearly locates the tracking target after 5-step GD in both MAML achieve striking AUC scores of 0.704 and 0.712,
training and testing images, while the baseline detector does respectively. To the best of our knowledge, Retina-MAML
not make any progress even after 20 steps of GD. is the best-performing tracker ever on OTB.

6294
AUC score (OPE) Speed TrackingNet LaSOT-test
Tracker
OTB-2013 OTB-50 OTB-100 (FPS) AUC N-Prec. AUC
CFNet [35] 0.611 0.530 0.568 75 C-RPN [9] 0.669 0.746 0.455
BACF [17] 0.656 0.570 0.621 35 SiamRPN++ [20] 0.733 0.800 0.496
ECO-hc [7] 0.652 0.592 0.643 60
SPM [36] 0.712 0.779 0.471
MCCT-hc [37] 0.664 - 0.642 45
ATOM [6] 0.703 0.771 0.515
ECO [7] 0.709 0.648 0.687 8
RTINet [42] - 0.637 0.682 9 DiMP-18 [3] 0.723 0.785 0.532
MCCT [37] 0.714 - 0.695 8 DiMP-50 [3] 0.740 0.801 0.569
SiamFC [2] 0.607 0.516 0.582 86 FCOS-MAML 0.757 0.822 0.523
SA-Siam [11] 0.677 0.610 0.657 50 Retina-MAML 0.698 0.786 0.480
RASNet [38] 0.670 - 0.642 83 Table 6: Comparison with SOTA trackers on TrackingNet
SiamRPN [21] 0.658 0.592 0.637 200
and LaSOT. We present the AUC of the success plot and
C-RPN [9] 0.675 - 0.663 23
SPM [36] 0.693 0.653 0.687 120
and the normalized precision (N-prec.).
SiamRPN++ [20] 0.691 0.662 0.696 35
Meta-Tracker [29] 0.684 0.627 0.658 -
Evaluation on VOT: Our trackers are tested on the VOT-
MemTracker [41] 0.642 0.610 0.626 50
UnifiedDet [13] 0.656 - 0.647 3 2018 benchmark [19] in comparison with six SOTA track-
MLT [5] 0.621 - 0.611 48 ers. We follow the official evaluation protocol and adopt Ex-
GradNet [22] 0.670 0.597 0.639 80 pected Average Overlap (EAO), Accuracy, and Robustness
MDNet [28] 0.708 0.645 0.678 1 as the metrics. The results are reported in Table 5. Retina-
VITAL [32] 0.710 0.657 0.682 2 MAML achieves the top-ranked performance on EAO cri-
ATOM [6] - 0.628 0.671 30 teria and FCOS-MAML also shows a strong performance.
DiMP [3] 0.691 0.654 0.684 43 Interestingly, FCOS-MAML has the highest accuracy score
FCOS-MAML 0.714 0.665 0.704 42 among all the trackers. We have observed a similar phe-
Retina-MAML 0.709 0.676 0.712 40 nomenon in Fig. 6 for OTB dataset. FCOS-MAML gets the
highest success rates when the overlap threshold is greater
Table 4: Comparison with SOTA trackers on OTB dataset.
than 0.7. This suggests that anchor-free detectors can pre-
Trackers are grouped into CF-based methods, siamese-
dict very precise bounding boxes.
network-based methods, meta-learning-based methods, and
Evaluation on LaSOT and TrackingNet: TrackingNet
miscellaneous. Numbers in red and blue are the best and the
[26] and LaSOT [8] are two large-scale datasets for vi-
second best results, respectively.
sual tracking. The evaluation results on these two datasets
are detailed in Table 6. Results show that FCOS-MAML
EAO Accuracy Robustness performs favorably against SOTA trackers, although many
DRT [33] 0.356 0.519 0.201 of them are using a more powerful backbone ResNet-50.
SiamRPN++ [20] 0.414 0.600 0.234 When compared with the recent DiMP-18 tracker which
UPDT [4] 0.378 0.536 0.184 uses the same backbone network as ours, FCOS-MAML
LADCF [40] 0.389 0.503 0.159 shows a significant gain on TrackingNet and a slight loss
ATOM [6] 0.401 0.590 0.204 on LaSOT. We suspect that our straightforward online up-
DiMP-18 [3] 0.402 0.594 0.182 dating strategy may not be suitable for very long sequences
DiMP-50 [3] 0.440 0.597 0.153 which are often seen in LaSOT.
FCOS-MAML 0.392 0.635 0.220
Retina-MAML 0.452 0.604 0.159 6. Conclusion
Table 5: Comparison with SOTA trackers on VOT-2018. In this paper, we have proposed a three-step procedure
The backbone used in our trackers is ResNet-18. to convert a general object detector into a tracker. Of-
fline MAML training prepares the detector for quick do-
main adaption as well as efficient online update. The re-
In this table, Meta-Tracker and UnifiedDet are two recent sulting instance detector is an elegant template-free tracker
trackers which also use MAML to assist online training. which fully benefits from the advancement in object detec-
Compared with them, our trackers achieve over 8% relative tion. While the two constructed trackers achieve compet-
gain in AUC and still run in real-time. For the first time, itive performance against SOTA trackers in datasets with
meta-learning-based methods are shown to be very compet- short videos, their performance on LaSOT still has room
itive against the mainstream solutions. The detailed success for improvement. In the future, we plan to investigate the
plot and precision plot on OTB-100 are shown in Fig. 6. online updating strategy for long sequences.

6295
References [19] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Fels-
berg, Roman Pflugfelder, Luka Cehovin Zajc, Tomas Vojir,
[1] Antreas Antoniou, Harrison Edwards, and Amos Storkey. Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al.
How to train your maml. arXiv preprint, 2018. The sixth visual object tracking vot2018 challenge results.
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea In Proceedings of the European Conference on Computer Vi-
Vedaldi, and Philip HS Torr. Fully-convolutional siamese sion (ECCV), pages 0–0, 2018.
networks for object tracking. In ECCV, pages 850–865,
[20] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing,
2016.
and Junjie Yan. Siamrpn++: Evolution of siamese visual
[3] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu
tracking with very deep networks. In CVPR, pages 4282–
Timofte. Learning discriminative model prediction for track-
4291, 2019.
ing. In ICCV, 2019.
[21] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu.
[4] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad
High performance visual tracking with siamese region pro-
Shahbaz Khan, and Michael Felsberg. Unveiling the power
posal network. In CVPR, pages 8971–8980, 2018.
of deep tracking. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 483–498, 2018. [22] Peixia Li, Boyu Chen, Wanli Ouyang, Dong Wang, Xiaoyun
[5] Janghoon Choi, Junseok Kwon, and Kyoung Mu Lee. Deep Yang, and Huchuan Lu. Gradnet: Gradient-guided network
meta learning for real-time target-aware visual tracking. In for visual object tracking. In ICCV, pages 6162–6171, 2019.
ICCV, pages 911–920, 2019. [23] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-
[6] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and sgd: Learning to learn quickly for few-shot learning. arXiv
Michael Felsberg. Atom: Accurate tracking by overlap max- preprint, 2017.
imization. In CVPR, pages 4660–4669, 2019. [24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[7] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Piotr Dollár. Focal loss for dense object detection. In ICCV,
Michael Felsberg. Eco: Efficient convolution operators for pages 2980–2988, 2017.
tracking. In CVPR, pages 6638–6646, 2017. [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[8] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Zitnick. Microsoft coco: Common objects in context. In
Lasot: A high-quality benchmark for large-scale single ob- ECCV, pages 740–755, 2014.
ject tracking. In CVPR, pages 5374–5383, 2019. [26] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al-
[9] Heng Fan and Haibin Ling. Siamese cascaded region pro- subaihi, and Bernard Ghanem. Trackingnet: A large-scale
posal networks for real-time visual tracking. In CVPR, pages dataset and benchmark for object tracking in the wild. In
7952–7961, 2019. ECCV, pages 300–317, 2018.
[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- [27] Hyeonseob Nam, Mooyeol Baek, and Bohyung Han. Model-
agnostic meta-learning for fast adaptation of deep networks. ing and propagating cnns in a tree structure for visual track-
In ICML, pages 1126–1135, 2017. ing. arXiv preprint, 2016.
[11] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A [28] Hyeonseob Nam and Bohyung Han. Learning multi-domain
twofold siamese network for real-time object tracking. In convolutional neural networks for visual tracking. In CVPR,
CVPR, pages 4834–4843, 2018. pages 4293–4302, 2016.
[12] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A
[29] Eunbyung Park and Alexander C Berg. Meta-tracker: Fast
large high-diversity benchmark for generic object tracking in
and robust online adaptation for visual object trackers. In
the wild. arXiv preprint arXiv:1810.11981, 2018.
ECCV, pages 569–585, 2018.
[13] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Bridging the
[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
gap between detection and tracking: A unified approach. In
Faster r-cnn: Towards real-time object detection with region
ICCV, pages 3999–4009, 2019.
proposal networks. In NIPS, pages 91–99, 2015.
[14] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
ing Jiang. Acquisition of localization confidence for accurate [31] Yibing Song, Chao Ma, Lijun Gong, Jiawei Zhang, Ryn-
object detection. In ECCV, pages 784–799, 2018. son WH Lau, and Ming-Hsuan Yang. Crest: Convolutional
[15] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. residual learning for visual tracking. In ICCV, pages 2555–
Real-time mdnet. In ECCV, pages 83–98, 2018. 2564, 2017.
[16] Ilchae Jung, Kihyun You, Hyeonwoo Noh, Minsu Cho, and [32] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao
Bohyung Han. Real-time object tracking via meta-learning: Bao, Wangmeng Zuo, Chunhua Shen, Rynson WH Lau, and
Efficient model adaptation and one-shot channel pruning. Ming-Hsuan Yang. Vital: Visual tracking via adversarial
arXiv preprint arXiv:1911.11170, 2019. learning. In CVPR, pages 8990–8999, 2018.
[17] Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey. [33] Chong Sun, Dong Wang, Huchuan Lu, and Ming-Hsuan
Learning background-aware correlation filters for visual Yang. Correlation tracking via joint discrimination and re-
tracking. In ICCV, pages 1135–1143, 2017. liability learning. In CVPR, pages 489–497, 2018.
[18] Diederik P Kingma and Jimmy Ba. Adam: A method for [34] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
stochastic optimization. arXiv preprint arXiv:1412.6980, Fully convolutional one-stage object detection. In ICCV,
2014. 2019.

6296
[35] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea
Vedaldi, and Philip HS Torr. End-to-end representation
learning for correlation filter based tracking. In CVPR, pages
2805–2813, 2017.
[36] Guangting Wang, Chong Luo, Zhiwei Xiong, and Wenjun
Zeng. Spm-tracker: Series-parallel matching for real-time
visual object tracking. In CVPR, pages 3643–3652, 2019.
[37] Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng
Wang, and Houqiang Li. Multi-cue correlation filters for ro-
bust visual tracking. In CVPR, pages 4844–4853, 2018.
[38] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming
Hu, and Stephen Maybank. Learning attentions: residual
attentional siamese network for high performance online vi-
sual tracking. In CVPR, pages 4854–4863, 2018.
[39] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track-
ing benchmark. T-PAMI, 37(9):1834–1848, 2015.
[40] Tianyang Xu, Zhen-Hua Feng, Xiao-Jun Wu, and Josef Kit-
tler. Learning adaptive discriminative correlation filters via
temporal consistency preserving spatial feature selection for
robust visual object tracking. TIP, 2019.
[41] Tianyu Yang and Antoni B Chan. Learning dynamic memory
networks for object tracking. In PECCV, pages 152–167,
2018.
[42] Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and
Wangmeng Zuo. Joint representation and truncated inference
learning for correlation filter based tracking. In ECCV, pages
552–567, 2018.
[43] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and
Weiming Hu. Distractor-aware siamese networks for visual
object tracking. In ECCV, pages 101–117, 2018.

6297

You might also like