Tracking by Instance Detection - A Meta-Learning Approach
Tracking by Instance Detection - A Meta-Learning Approach
Guangting Wang1 Chong Luo2 Xiaoyan Sun2 Zhiwei Xiong1 Wenjun Zeng2
University of Science and Technology of China1 Microsoft Research Asia2
wgting96@gmail.com {cluo, xysun, wezeng}@microsoft.com zwxiong@ustc.edu.cn
Abstract
Training image
6288
[13] shares a similar vision with us but they still treat track- offline-trained CNN. A cross-correlation operation is then
ing as a two-step task, namely class-level object detection adopted to compute the matching scores. A main draw-
and instance-level classification. In the first sub-task, a tem- back of SiamFC is that it only evaluates the candidates with
plate image is involved and a separate branch is employed the same shape as the initial box. SiamRPN [21] solves
to process the template. this problem by borrowing the RPN idea from object de-
In this work, we are looking for a neat solution to realize tectors. Later, SPM-Tracker [36] borrows the architecture
our idea. The constructed tracker will just look like a nor- from two-stage detectors and achieves an improved perfor-
mal detector, without additional branches or any other mod- mance. Currently, the best-performing trackers in this cate-
ifications to the network architecture. We find that model- gory are ATOM [6] and DiMP [3], which leverage the most
agnostic meta learning (MAML) [10] offers a learning strat- advanced IoUNet [14] for precise object localization.
egy with which a detector can be initialized as we de- Template-based methods usually run very fast, because
sire. Based on MAML, we propose a three-step procedure the CNN used to extract features does not need to be on-
to convert any modern detector into a high-performance line updated. However, as the tracking proceeds, new target
tracker. First, pick any detector which is trained with gradi- appearances should be integrated into the template for a bet-
ent descent. Second, use MAML to train the detector on a ter performance. But most methods lack an effective model
large number of tracking sequences. Third, when the initial for template online updating. This limitation has created a
frame of a test sequence is given, fine-tune the detector with performance ceiling for template-based trackers.
a few steps of gradient descent. A decent tracker can be ob- The other category is template-free methods [27, 28, 15],
tained after this domain adaptation step. During tracking, which intend to store the target appearance information
when new appearances of the target are collected, the de- within the neural network, in the form of fine-tuned param-
tector can be trained with more samples to achieve an even eters. The challenge in designing template-free trackers is
better adaptation capability. how to quickly infuse the instance information to the net-
Following the proposed procedure, we build two in- work without overfitting. MDNet [28] divides the CNN
stance detectors, named Retina-MAML and FCOS-MAML, into shared layers and domain-specific layers. The shared
based on advanced object detectors RetinaNet [24] and layers provide a reasonable initialization and the domain-
FCOS [34]. During offline training, we further introduce specific layers are online trained with the new instance. Due
a kernel-wise learnable learning rate in MAML to improve to the limitation of the conventional training strategy, MD-
the expressive ability of gradient based updating. Evalua- Net takes many iterations to converge and fewer iterations
tions of the trackers are carried out on four major bench- cause serious performance degradation. As a result, MDNet
marks, including OTB, VOT, TrackingNet and LaSOT. Sys- is too slow to be used in real-time scenarios.
tem comparisons show that both trackers achieve compet- We find that template-free trackers are neat solutions.
itive performance against state-of-the-art (SOTA) trackers. They do not need to maintain an external template and the
On OTB-100, Retina-MAML and FCOS-MAML appear to network architecture looks just like a detector. Domain
be the best-performing trackers, with AUCs of 0.712 and adaptation and online update can be achieved by a unified
0.704, respectively. Retina-MAML achieves an EAO of online training procedure. However, it is still quite chal-
0.452 on VOT-2018. FCOS-MAML achieves an AUC of lenging to achieve a good performance-speed tradeoff for
0.757 on TrackingNet, ranking number one on the leader this type of trackers.
board. Additionally, both trackers run in real-time at 40
FPS. 2.2. Meta learning and its application to tracking
The goal of meta-learning is to train a model on a variety
2. Related Work of learning tasks, such that it can solve new learning tasks
using only a small number of training samples [10]. When
2.1. CNN-based visual object tracking
we view object tracking as an instance detection task, the
With the great success of deep learning and convolu- tracker is trained on a variety of instance detection tasks
tional neural networks (CNN) in various computer vision so that it can quickly learn how to detect a new instance
tasks, there emerge an increasing number of CNN-based using only one or a few training samples from the initial or
trackers. We divide CNN-based trackers into two cate- previous frames. We find that the tracking task is a perfect
gories, depending on whether an explicit template is used. example to apply meta-learning.
Most siamese-network-based trackers [2, 21, 20, 36] fall Model-agnostic meta-learning (MAML) [10] is an im-
into the first category, which we call template-based meth- portant algorithm for meta learning. It helps the network
ods. The target appearance information is stored in an to learn a set of good initialization parameters that are suit-
explicit template. In SiamFC [2], features are extracted able for fine-tuning. During training, the parameters of the
from the template and the search region using the same model are explicitly trained such that a small number of
6289
Training loss 1 Training loss 2 Training loss N
Params Update
gradient steps with a small amount of training data from tor, so that it can quickly adapt to a new instance when only
a new task will produce good generalization performance the initial frame is available. In this section, we present
on that task. The most striking merit of MAML is that it the approach to learn an instance detector with MAML.
is compatible with any model trained with gradient descent The complete steps to construct a tracker will be detailed
and applicable to a variety of different learning problems. in the next section. The training data in this learning step
Because of this, MAML is a perfect candidate to realize are videos with ground-truth labeling of the target object on
our idea, which is to convert any advanced object detec- each frame.
tors (trained with gradient descent) into a tracker. Later, Formally, given a video Vi , we collect a set of training
MAML++ [1] introduces a set of tricks to stabilize the train- samples, denoted by Dis . It is also called the support set
ing of MAML. MetaSGD [23] proposes to train learnable in meta learning. A detector model is defined as h(x; θ 0 ),
learning rates for every parameter. In the area of object where x is the input image and θ 0 is the parameters of the
tracking, Meta-Tracker [29] is the first to use MAML for detector. We update the detector on the support set by a
the domain adaptation step of MDNet. MetaRTT [16] fur- k-step gradient descent (GD) algorithm:
ther applies MAML for the online updating step. Basically,
their main purpose is to accelerate the online training of ex- θ k ≡ GDk (θ 0 , Dis ) , and
isting trackers, including MDNet [28], CREST [31] and RT- 1 X
MDNet [15]. We argue that, since meta learning provides a θ k = θ k−1 − α s ∇θk−1 L(h(x; θ k−1 ), y),
|Di |
(x,y)∈Dis
mechanism to quickly adapt a deep network to model a par-
ticular object and avoid overfitting, why not directly con- (1)
vert a modern object detector into a tracker, instead of mak-
ing a slow tracker faster? Huang et al. [13] have the same where L is the loss function and (x, y) is a data-label pair in
idea. They propose to learn a meta layer in detection head the support set. The procedure in Eqn. (1) is called inner-
by MAML. However, they still introduce a template in the level optimization. To evaluate the generalization ability
first part of the tracker called class-level object detection. of the trained detector, we collect another set of samples
The complex design results in a slow speed. Dit from the same video Vi and they are called the target
set. We calculate the loss on the target set by applying the
trained detector, which can be written as:
3. Learning an Instance Detector with MAML
1 X
The key to convert a detector into an instance detector F (θ 0 , Di ) = L(h(x; θ k ), y) (2)
(a tracker) is to provide a good initialization of the detec- |Dit |
(x,y)∈Dit
6290
where Di = {Dis , Dit } denotes the combined support set where α is a tensor which has the same size as θ k . No-
and target set. The overall training objective is to find a tation ⊙ denotes the element-wise product. However, set-
good initialization status θ 0 for any tracking video. It can ting up a learning rate for every parameter will double the
be formulated as: model size. In contrast, we arrange the learnable learning
rates in a kernel-wise manner. Specifically, for a convolu-
1 X
N
tion layer with Cout output channels, we define a learning
θ ∗ = arg min F (θ 0 , Di ), (3)
θ0 N i rate for each convolutional kernel and this only introduces
an additional number of Cout learnable parameters, which
where N is the total number of videos. The procedure in are negligible in the model.
Eqn. (3) is called the outer-level optimization, which can
be solved by gradient-based methods like Adam [18]. The 4. Retina-MAML and FCOS-MAML
outer-level gradients are back-propagated through the inner- This section provides the details of the proposed three-
level computational graph. The only assumption about the step procedure to build a tracker. Specifically, we will
detector h is that it is differentiable. Therefore, this ap- present detector choices, offline training details, and the on-
proach is readily applicable to most deep learning based line tracking process for two trackers named Retina-MAML
detectors. and FCOS-MAML.
Fig. 2 illustrates this training pipeline. In the training
phase, we only sample a pair of images from the dataset. 4.1. Detectors
Following the practice in DaSiamRPN [43], these two im-
As MAML is a model-agnostic learning approach, we
ages may come from either the same sequence or differ-
are free to choose any modern detector trained with gradient
ent sequences. The first image will be zoomed in/out by a
descent as the base to build a tracker. As the first attempt in
constant factor (1.08 in our experiments) so that a support
this direction, we choose two single-stage detectors which
set with three images is constructed for the inner-level op-
run faster and are fairly easy to manipulate than their two-
timization. The second image is viewed as the target set
stage counterparts. However, we do not see any obstacles in
with single image for calculating the outer-level loss. We
using two-stage detectors in our approach.
use a 4-step GD for the inner-level optimization and Adam
Single-stage detectors are usually composed of a back-
solver [18] for the outer-level optimization. To stabilize the
bone network and two heads, namely classification head and
training and strengthen the power of detector, we make the
regression head. The backbone network generates feature
following modifications to the original MAML algorithm.
maps for the input image. Based on the feature maps, the
Multi-step loss optimization. MAML++ [1] proposes objects are scored and localized.
to take the parameters after every step of inner-level GD to RetinaNet [24] is a representative single-stage object de-
minimize the loss on target set, instead of only using the tector. Each pixel in the feature maps is associated with sev-
parameters after final step. Mathematically, Eqn. (2) can be eral predefined prior boxes, or anchors. The classification
re-written into: head is trained to classify whether each anchor has a suffi-
X X
K cient overlap with an object. The regression head is trained
1
F (θ0 , Di ) = γk L(h(x; θ k ), y), (4) to predict the relative differences between each anchor and
|Dit | the corresponding ground-truth box. Similar design can be
(x,y)∈Dit k=0
found in many existing detectors, which are grouped into a
where K is the number of inner-level steps and γk is the loss family of anchor-based detectors.
weight for each step. Note that our formulation is slightly Recently, the concept of anchor-free detection has re-
different from that in MAML++. The initialization param- ceived a lot of attention. As the name suggests, no anchor
eter θ 0 (before updating) also contributes to the outer-level is defined. FCOS [34] is a representative detector in this
loss. In our experiments, we find this trick is crucial for category. After the backbone network generates the feature
stabilizing the gradients. maps, the classification head is trained to classify whether
Kernel-wise learnable learning rate. In standard each pixel in the feature maps is within the central area of an
MAML, the learning rate α in the inner-level optimization object. Meanwhile, the regression head directly estimates
is a predefined constant. MetaSGD [23] proposes to specify the four offsets from the pixel to the object boundaries. Fig.
a learnable learning rate for each parameter in the model. 3 depicts the core design difference between anchor-free
Therefore, the GD algorithm in Eqn. (1) can be re-written and anchor-based detectors.
into: Next, we make some simplifications to the chosen detec-
X tors RetinaNet and FCOS. These simplifications improve
1
θ k+1 = θ k − α ⊙ ∇θk L(h(x; θ k ), y), (5) the tracker’s speed but will not affect the tracking perfor-
|Dis | mance. We believe so because visual object tracking is per-
(x,y)∈Dis
6291
Cls. Branch
Shared layers
Reg. Branch
(a) Anchor-based detector (b) Anchor-free detector Offline Online
Frozen
trainable trainable
Figure 3: (a) Anchor-based detectors predict the relative dif- Figure 4: We adopt ResNet-18 as the backbone. The first
ferences between the anchor and the ground-truth box. The three blocks are frozen after ImageNet pre-training and
dotted yellow box represents an anchor. (b) Anchor-free de- block-5 is removed. Block-4 is independently trained in
tectors directly estimate four offsets from the pixel to object the classification branch and the regression branch during
boundaries. offline training. Online training only involves a subset of
trainable layers.
6292
Algorithm 1 Online tracking algorithm Domain OTB-100 VOT-18 LaSOT TrackingNet
Detector
Adaptation (AUC) (EAO) (AUC) (AUC)
Input: Frame sequence {Ii }N
i=1 ,detector h(·; θ), initialization
before 0.460 0.137 0.391 0.601
bounding box B1 , update interval u. Baseline
after 0.487 0.174 0.391 0.634
Output: Tracking results {Bi }N i=1
before 0.464 0.162 0.387 0.626
1: Generate search region image. S1 ← SR(I1 , B1 ) MAML
after 0.671 0.341 0.511 0.743
2: Initialize the support set. D s ← {DataAug(S1 )}
3: Model update in Eqn. (1). θ ← GD5 (θ, D s )
Table 1: MAML training allows a detector to quickly adapt
4: for i = 2, ..., N do
5: Detect objects represented in bounding box and score.
to a new domain, and therefore is the key in turning a detec-
j
{Bdet , c j }M tor into a tracker.
j=1 ← h(SR(Ii , Bi−1 ); θ)
6: if all cj < 0.1 then
7: Bi ← Bi−1 1.0 MAML 3.0
Baseline
8: continue 2.5
0.8
9: end if
2.0
Training loss
j
10: Add penalties and window priors to {Bdet , cj }M
Test loss
j=1 0.6 MAML
11: Select the box with the highest score c∗ . Bi ← Bdet ∗
1.5 Baseline
Test image
given bounding box. As with the offline training, we also
MAML Step 5 MAML Step 20 MAML Step 5 MAML Step 20
adopt zoom in/out data augmentation to construct the sup-
port set. The tracker is updated by a 5-step GD as described
in Eqn. (5).
After domain adaptation, the detector is now capable of
tracking the target object in subsequent frames. For each (b) Visualization
search region patch, the detector locates hundreds of can- Figure 5: Comparison of the MAML detector and the base-
didate bounding boxes, which are then passed to a stan- line detector during domain adaptation. (a) Quantitative
dard post-processing pipeline as suggested in SiamRPN losses on the training image and a testing image. (b) Visu-
[21]. Specifically, shape penalty function and cosine win- alization of the corresponding score maps. MAML detector
dow function are applied to each candidate. Finally, the can- convergences quickly and has strong generalization ability.
didate box with the highest score is selected as the tracking
result and its shape is smoothed by a linear interpolation KLLR in OTB-100 VOT-18 LaSOT TrackingNet
with the result in the previous frame. cls. reg. (AUC) (EAO) (AUC) (AUC)
During tracking, the support set is gradually enlarged. 0.628 0.313 0.490 0.733
The tracker can be online trained at a pre-defined interval X 0.661 0.368 0.502 0.737
based on the updated support set. This process is often X 0.676 0.315 0.504 0.744
called online updating in tracking. If a tracking result has X X 0.704 0.392 0.523 0.757
a score above a predefined threshold, it will be added into
the support set. We buffer at most 30 training images in the Table 2: Ablation analysis of kernel-wise learnable learning
support set. Earlier samples, except the initial one, will be rate. Cls. and reg. denote the classification branch and the
discarded when the number of images exceeds the limit. Af- regression branch, respectively.
ter every n frames (n = 10 in our implementation) or when
a distracting peak is detected (when the peak-to-sidelobe
is greater than 0.7), we perform online updating. In this 5. Experiments
case, we only use 1-step GD to maintain a high tracking
5.1. Ablation study
speed. On average, our tracker can run at 40 FPS on a single
NVIDIA P100 GPU card. The online tracking procedure is Meta-learning is the key in turning a detector into a
summarized in Alg. 1. tracker. In a nutshell, an instance detector can be built by of-
6293
Success plots of OPE − OTB100 Precision plots of OPE − OTB100
Online OTB-100 VOT-18 TrackingNet LaSOT Speed 1 1
0.9 0.9
cls. reg. (AUC) (EAO) (AUC) (AUC) (FPS) 0.8 0.8
Retina−MAML [0.712]
0.671 0.341 0.743 0.511 85 0.7 0.7
Success rate
FCOS−MAML [0.704]
Precision
0.6 SiamRPN++ [0.696] 0.6 Retina−MAML [0.926]
X 0.690 0.394 0.747 0.523 58 0.5 ECO [0.691]
SPM [0.687]
0.5 VITAL [0.918]
SiamRPN++ [0.915]
0.4 0.4 ECO [0.910]
X X 0.704 0.392 0.757 0.496 42 0.3
DiMP [0.686]
VITAL [0.682] 0.3
MDNet [0.909]
FCOS−MAML [0.905]
0.2 MDNet [0.678] 0.2
SPM [0.899]
DiMP [0.899]
ATOM [0.667] MetaTracker [0.880]
0.1 0.1
Table 3: Ablation analysis of the online updating strat- 0
0
MetaTracker [0.658]
0.2 0.4 0.6 0.8 1
0
0 10 20
ATOM [0.879]
30 40 50
Overlap threshold Location error threshold
egy. The baseline tracker without online updating achieves
a good performance-speed tradeoff. Online updating both Figure 6: The success plot and precision plot on OTB-100.
branches is the best choice for tracking short sequences.
6294
AUC score (OPE) Speed TrackingNet LaSOT-test
Tracker
OTB-2013 OTB-50 OTB-100 (FPS) AUC N-Prec. AUC
CFNet [35] 0.611 0.530 0.568 75 C-RPN [9] 0.669 0.746 0.455
BACF [17] 0.656 0.570 0.621 35 SiamRPN++ [20] 0.733 0.800 0.496
ECO-hc [7] 0.652 0.592 0.643 60
SPM [36] 0.712 0.779 0.471
MCCT-hc [37] 0.664 - 0.642 45
ATOM [6] 0.703 0.771 0.515
ECO [7] 0.709 0.648 0.687 8
RTINet [42] - 0.637 0.682 9 DiMP-18 [3] 0.723 0.785 0.532
MCCT [37] 0.714 - 0.695 8 DiMP-50 [3] 0.740 0.801 0.569
SiamFC [2] 0.607 0.516 0.582 86 FCOS-MAML 0.757 0.822 0.523
SA-Siam [11] 0.677 0.610 0.657 50 Retina-MAML 0.698 0.786 0.480
RASNet [38] 0.670 - 0.642 83 Table 6: Comparison with SOTA trackers on TrackingNet
SiamRPN [21] 0.658 0.592 0.637 200
and LaSOT. We present the AUC of the success plot and
C-RPN [9] 0.675 - 0.663 23
SPM [36] 0.693 0.653 0.687 120
and the normalized precision (N-prec.).
SiamRPN++ [20] 0.691 0.662 0.696 35
Meta-Tracker [29] 0.684 0.627 0.658 -
Evaluation on VOT: Our trackers are tested on the VOT-
MemTracker [41] 0.642 0.610 0.626 50
UnifiedDet [13] 0.656 - 0.647 3 2018 benchmark [19] in comparison with six SOTA track-
MLT [5] 0.621 - 0.611 48 ers. We follow the official evaluation protocol and adopt Ex-
GradNet [22] 0.670 0.597 0.639 80 pected Average Overlap (EAO), Accuracy, and Robustness
MDNet [28] 0.708 0.645 0.678 1 as the metrics. The results are reported in Table 5. Retina-
VITAL [32] 0.710 0.657 0.682 2 MAML achieves the top-ranked performance on EAO cri-
ATOM [6] - 0.628 0.671 30 teria and FCOS-MAML also shows a strong performance.
DiMP [3] 0.691 0.654 0.684 43 Interestingly, FCOS-MAML has the highest accuracy score
FCOS-MAML 0.714 0.665 0.704 42 among all the trackers. We have observed a similar phe-
Retina-MAML 0.709 0.676 0.712 40 nomenon in Fig. 6 for OTB dataset. FCOS-MAML gets the
highest success rates when the overlap threshold is greater
Table 4: Comparison with SOTA trackers on OTB dataset.
than 0.7. This suggests that anchor-free detectors can pre-
Trackers are grouped into CF-based methods, siamese-
dict very precise bounding boxes.
network-based methods, meta-learning-based methods, and
Evaluation on LaSOT and TrackingNet: TrackingNet
miscellaneous. Numbers in red and blue are the best and the
[26] and LaSOT [8] are two large-scale datasets for vi-
second best results, respectively.
sual tracking. The evaluation results on these two datasets
are detailed in Table 6. Results show that FCOS-MAML
EAO Accuracy Robustness performs favorably against SOTA trackers, although many
DRT [33] 0.356 0.519 0.201 of them are using a more powerful backbone ResNet-50.
SiamRPN++ [20] 0.414 0.600 0.234 When compared with the recent DiMP-18 tracker which
UPDT [4] 0.378 0.536 0.184 uses the same backbone network as ours, FCOS-MAML
LADCF [40] 0.389 0.503 0.159 shows a significant gain on TrackingNet and a slight loss
ATOM [6] 0.401 0.590 0.204 on LaSOT. We suspect that our straightforward online up-
DiMP-18 [3] 0.402 0.594 0.182 dating strategy may not be suitable for very long sequences
DiMP-50 [3] 0.440 0.597 0.153 which are often seen in LaSOT.
FCOS-MAML 0.392 0.635 0.220
Retina-MAML 0.452 0.604 0.159 6. Conclusion
Table 5: Comparison with SOTA trackers on VOT-2018. In this paper, we have proposed a three-step procedure
The backbone used in our trackers is ResNet-18. to convert a general object detector into a tracker. Of-
fline MAML training prepares the detector for quick do-
main adaption as well as efficient online update. The re-
In this table, Meta-Tracker and UnifiedDet are two recent sulting instance detector is an elegant template-free tracker
trackers which also use MAML to assist online training. which fully benefits from the advancement in object detec-
Compared with them, our trackers achieve over 8% relative tion. While the two constructed trackers achieve compet-
gain in AUC and still run in real-time. For the first time, itive performance against SOTA trackers in datasets with
meta-learning-based methods are shown to be very compet- short videos, their performance on LaSOT still has room
itive against the mainstream solutions. The detailed success for improvement. In the future, we plan to investigate the
plot and precision plot on OTB-100 are shown in Fig. 6. online updating strategy for long sequences.
6295
References [19] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Fels-
berg, Roman Pflugfelder, Luka Cehovin Zajc, Tomas Vojir,
[1] Antreas Antoniou, Harrison Edwards, and Amos Storkey. Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, et al.
How to train your maml. arXiv preprint, 2018. The sixth visual object tracking vot2018 challenge results.
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea In Proceedings of the European Conference on Computer Vi-
Vedaldi, and Philip HS Torr. Fully-convolutional siamese sion (ECCV), pages 0–0, 2018.
networks for object tracking. In ECCV, pages 850–865,
[20] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing,
2016.
and Junjie Yan. Siamrpn++: Evolution of siamese visual
[3] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu
tracking with very deep networks. In CVPR, pages 4282–
Timofte. Learning discriminative model prediction for track-
4291, 2019.
ing. In ICCV, 2019.
[21] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu.
[4] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad
High performance visual tracking with siamese region pro-
Shahbaz Khan, and Michael Felsberg. Unveiling the power
posal network. In CVPR, pages 8971–8980, 2018.
of deep tracking. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 483–498, 2018. [22] Peixia Li, Boyu Chen, Wanli Ouyang, Dong Wang, Xiaoyun
[5] Janghoon Choi, Junseok Kwon, and Kyoung Mu Lee. Deep Yang, and Huchuan Lu. Gradnet: Gradient-guided network
meta learning for real-time target-aware visual tracking. In for visual object tracking. In ICCV, pages 6162–6171, 2019.
ICCV, pages 911–920, 2019. [23] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-
[6] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and sgd: Learning to learn quickly for few-shot learning. arXiv
Michael Felsberg. Atom: Accurate tracking by overlap max- preprint, 2017.
imization. In CVPR, pages 4660–4669, 2019. [24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
[7] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Piotr Dollár. Focal loss for dense object detection. In ICCV,
Michael Felsberg. Eco: Efficient convolution operators for pages 2980–2988, 2017.
tracking. In CVPR, pages 6638–6646, 2017. [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[8] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Zitnick. Microsoft coco: Common objects in context. In
Lasot: A high-quality benchmark for large-scale single ob- ECCV, pages 740–755, 2014.
ject tracking. In CVPR, pages 5374–5383, 2019. [26] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Al-
[9] Heng Fan and Haibin Ling. Siamese cascaded region pro- subaihi, and Bernard Ghanem. Trackingnet: A large-scale
posal networks for real-time visual tracking. In CVPR, pages dataset and benchmark for object tracking in the wild. In
7952–7961, 2019. ECCV, pages 300–317, 2018.
[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- [27] Hyeonseob Nam, Mooyeol Baek, and Bohyung Han. Model-
agnostic meta-learning for fast adaptation of deep networks. ing and propagating cnns in a tree structure for visual track-
In ICML, pages 1126–1135, 2017. ing. arXiv preprint, 2016.
[11] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A [28] Hyeonseob Nam and Bohyung Han. Learning multi-domain
twofold siamese network for real-time object tracking. In convolutional neural networks for visual tracking. In CVPR,
CVPR, pages 4834–4843, 2018. pages 4293–4302, 2016.
[12] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A
[29] Eunbyung Park and Alexander C Berg. Meta-tracker: Fast
large high-diversity benchmark for generic object tracking in
and robust online adaptation for visual object trackers. In
the wild. arXiv preprint arXiv:1810.11981, 2018.
ECCV, pages 569–585, 2018.
[13] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Bridging the
[30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
gap between detection and tracking: A unified approach. In
Faster r-cnn: Towards real-time object detection with region
ICCV, pages 3999–4009, 2019.
proposal networks. In NIPS, pages 91–99, 2015.
[14] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yun-
ing Jiang. Acquisition of localization confidence for accurate [31] Yibing Song, Chao Ma, Lijun Gong, Jiawei Zhang, Ryn-
object detection. In ECCV, pages 784–799, 2018. son WH Lau, and Ming-Hsuan Yang. Crest: Convolutional
[15] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. residual learning for visual tracking. In ICCV, pages 2555–
Real-time mdnet. In ECCV, pages 83–98, 2018. 2564, 2017.
[16] Ilchae Jung, Kihyun You, Hyeonwoo Noh, Minsu Cho, and [32] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao
Bohyung Han. Real-time object tracking via meta-learning: Bao, Wangmeng Zuo, Chunhua Shen, Rynson WH Lau, and
Efficient model adaptation and one-shot channel pruning. Ming-Hsuan Yang. Vital: Visual tracking via adversarial
arXiv preprint arXiv:1911.11170, 2019. learning. In CVPR, pages 8990–8999, 2018.
[17] Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey. [33] Chong Sun, Dong Wang, Huchuan Lu, and Ming-Hsuan
Learning background-aware correlation filters for visual Yang. Correlation tracking via joint discrimination and re-
tracking. In ICCV, pages 1135–1143, 2017. liability learning. In CVPR, pages 489–497, 2018.
[18] Diederik P Kingma and Jimmy Ba. Adam: A method for [34] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
stochastic optimization. arXiv preprint arXiv:1412.6980, Fully convolutional one-stage object detection. In ICCV,
2014. 2019.
6296
[35] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea
Vedaldi, and Philip HS Torr. End-to-end representation
learning for correlation filter based tracking. In CVPR, pages
2805–2813, 2017.
[36] Guangting Wang, Chong Luo, Zhiwei Xiong, and Wenjun
Zeng. Spm-tracker: Series-parallel matching for real-time
visual object tracking. In CVPR, pages 3643–3652, 2019.
[37] Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng
Wang, and Houqiang Li. Multi-cue correlation filters for ro-
bust visual tracking. In CVPR, pages 4844–4853, 2018.
[38] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming
Hu, and Stephen Maybank. Learning attentions: residual
attentional siamese network for high performance online vi-
sual tracking. In CVPR, pages 4854–4863, 2018.
[39] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track-
ing benchmark. T-PAMI, 37(9):1834–1848, 2015.
[40] Tianyang Xu, Zhen-Hua Feng, Xiao-Jun Wu, and Josef Kit-
tler. Learning adaptive discriminative correlation filters via
temporal consistency preserving spatial feature selection for
robust visual object tracking. TIP, 2019.
[41] Tianyu Yang and Antoni B Chan. Learning dynamic memory
networks for object tracking. In PECCV, pages 152–167,
2018.
[42] Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and
Wangmeng Zuo. Joint representation and truncated inference
learning for correlation filter based tracking. In ECCV, pages
552–567, 2018.
[43] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and
Weiming Hu. Distractor-aware siamese networks for visual
object tracking. In ECCV, pages 101–117, 2018.
6297