[go: up one dir, main page]

0% found this document useful (0 votes)
29 views13 pages

Mutlti Object Detecting

The document presents ByteTrack, a novel multi-object tracking method that associates every detection box, including low-score ones, to improve tracking performance and reduce missing detections. By leveraging similarities with tracklets, ByteTrack achieves significant improvements in IDF1 scores across various state-of-the-art trackers, demonstrating its effectiveness on the MOT17 and MOT20 datasets. The method emphasizes the importance of utilizing all detection boxes to enhance tracking accuracy, particularly in challenging scenarios like occlusion.

Uploaded by

Mai Gado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Mutlti Object Detecting

The document presents ByteTrack, a novel multi-object tracking method that associates every detection box, including low-score ones, to improve tracking performance and reduce missing detections. By leveraging similarities with tracklets, ByteTrack achieves significant improvements in IDF1 scores across various state-of-the-art trackers, demonstrating its effectiveness on the MOT17 and MOT20 datasets. The method emphasizes the importance of utilizing all detection boxes to enhance tracking accuracy, particularly in challenging scenarios like occlusion.

Uploaded by

Mai Gado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Yifu Zhang1∗ , Peize Sun2∗ , Yi Jiang3 , Dongdong Yu3 ,


Zehuan Yuan3 , Ping Luo2 , Wenyu Liu1 , Xinggang Wang1†
1 2 3
Huazhong University of Science and Technology The University of Hong Kong ByteDance
arXiv:2110.06864v2 [cs.CV] 14 Oct 2021

Abstract

Multi-object tracking (MOT) aims at estimating bound-


ing boxes and identities of objects in videos. Most meth-
ods obtain identities by associating detection boxes whose
scores are higher than a threshold. The objects with low
detection scores, e.g. occluded objects, are simply thrown
away, which brings non-negligible true object missing and
fragmented trajectories. To solve this problem, we present
a simple, effective and generic association method, track-
ing by associating every detection box instead of only the
high score ones. For the low score detection boxes, we
utilize their similarities with tracklets to recover true ob-
jects and filter out the background detections. When ap-
plied to 9 different state-of-the-art trackers, our method
achieves consistent improvement on IDF1 score ranging
from 1 to 10 points. To put forwards the state-of-the-
art performance of MOT, we design a simple and strong Figure 1. MOTA-IDF1-FPS comparisons of different trackers. The
tracker, named ByteTrack. For the first time, we achieve horizontal axis is FPS (running speed), the vertical axis is MOTA,
80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of and the radius of circle is IDF1. Our ByteTrack achieves 80.3
MOTA, 77.3 IDF1 on MOT17 test set with 30 FPS running speed,
MOT17 with 30 FPS running speed on a single V100 GPU.
outperforming all previous trackers. Details are given in Table 6.
The source code, pre-trained models with deploy versions
and tutorials of applying to other trackers are released at
https://github.com/ifzhang/ByteTrack.
way to eliminate all low confidence detection boxes? Our
answer is NO: as Hegel said “What is reasonable is real;
1. Introduction that which is real is reasonable.” Low confidence detection
boxes sometimes indicate the existence of objects, e.g. the
Was vernünftig ist, das ist wirklich; und was wirklich ist, occluded objects. Filtering out these objects causes irre-
das ist vernünftig. versible errors for MOT and brings non-negligible missing
—— G. W. F. Hegel detection and fragmented trajectories.
Figure 2 (a) and (b) show this problem. In frame t1 ,
Tracking by detection is the most effective paradigm for
we initialize three different tracklets as their scores are
multi-object tracking (MOT) in current. Due to the com-
all higher than 0.5. However, in frame t2 and frame t3
plex scenarios in video, detectors are prone to make imper-
when occlusion happens, red tracklet’s corresponding de-
fect predictions. State-of-the-art MOT methods [3, 17, 44,
tection score becomes lower i.e. 0.8 to 0.4 and then 0.4 to
2, 69, 7, 67, 12, 4, 81, 56] need to deal with true positive /
0.1. These detection boxes are eliminated by the threshold-
false positive trade-off in detection boxes to eliminate low
ing mechanism and the red tracklet disappears accordingly.
confidence detection boxes [5, 39]. However, is it the right
Nevertheless, if we take every detection box into consider-
∗ Equal contribution. † Corresponding author. ation, more false positives will be introduced immediately,

1
Frame t1 Frame t2 Frame t3 score detection boxes. Figure 2 (c) shows the results after
0.9 0.9 0.9 the second matching. The occluded person with low detec-
0.1 0.1 0.1
0.9 0.4 0.8 0.1 0.8
0.8 tion scores is matched correctly to the previous tracklet and
the background is removed.
To evaluate the generalization ability of our proposed as-
sociation method, we apply it to 9 different state-of-the-art
(a) detection boxes trackers, including the Re-ID-based ones [66, 81, 32, 46],
motion-based ones [85, 68, 47], chain-based one [47] and
attention-based ones [56, 76]. We achieve notable improve-
ments on almost all the metrics including MOTA, IDF1
score and ID switches. For example, we increase the MOTA
of CenterTrack [85] from 66.1 to 67.4, IDF1 from 64.2 to
(b) tracklets by associating high score detection boxes 74.0 and decrease the IDs from 528 to 144 on the half vali-
dation set of MOT17 [85].
0.4 0.1 Towards pushing forwards the state-of-the-art perfor-
mance of MOT, we propose a simple and strong tracker,
named ByteTrack. We adopt a recent high-performance
detector YOLOX [24] to obtain the detection boxes and
associate them with our proposed BYTE. On the MOT
(c) tracklets by associating every detection box
challenges, ByteTrack ranks 1st on both MOT17 [43] and
Figure 2. Examples of our method which associates every detec- MOT20 [16], achieving 80.3 MOTA, 77.3 IDF1 and 63.1
tion box. (a) shows all the detection boxes with their scores. (b) HOTA with 30 FPS running speed on V100 GPU on
shows the tracklets obtained by previous methods which associates MOT17 and 77.8 MOTA, 75.2 IDF1 and 61.3 HOTA on
detection boxes whose scores are higher than a threshold, i.e. 0.5. crowded MOT20.
The same box color represents the same identity. (c) shows the Our proposed method is the first work that achieves
tracklets obtained by our method. The dashed boxes represent the
highly competitive tracking performance by the extremely
predicted box of the previous tracklets using Kalman Filter. The
two low score detection boxes are correctly matched to the previ-
simple motion model, without any Re-ID module or atten-
ous tracklets. tion mechanisms [81, 32, 46, 65, 76, 56]. It sheds light on
the great potential of motion cues on handling occlusion and
long-range association. We hope the efficiency and sim-
e.g., the most right box in frame t3 of Figure 2 (a). To the plicity of ByteTrack could make it attractive in real applica-
best of our knowledge, very few methods [29, 60] in MOT tions.
are able to handle this detection dilemma.
In this paper, we identify that the similarity with tracklets 2. Related Work
provides a strong cue to distinguish the objects and back- Object detection and data association are two key com-
ground in low score detection boxes. As shown in Figure 2 ponents of multi-object tracking. Detection estimates the
(c), two low score detection boxes are matched to the track- bounding boxes and association obtains the identities.
lets by the motion model’s predicted boxes, and thus the
objects are correctly recovered. At the same time, the back- 2.1. Object Detection in MOT
ground box is removed since it has no matched tracklet.
Object detection is one of the most active topics in
For making full use of detection boxes from high scores
computer vision and it is the basis of multi-object track-
to low ones in the matching process, we present a simple
ing. The MOT17 dataset [43] provides detection results
and effective association method BYTE, named for each
obtained by popular detectors such as DPM [21], Faster
detection box is a basic unit of the tracklet, as byte in com-
R-CNN [49] and SDP [74]. A large number of methods
puter program, and our tracking method values every de-
[71, 14, 4, 12, 88, 9, 27] focus on improving the tracking
tailed detection box. We first match the high score detection
performance based on these given detection results. The as-
boxes to the tracklets based on motion similarity. Similar to
sociation ability of these methods can be fairly compared.
[7], we use Kalman Filter [28] to predict the location of the
tracklets in the new frame. The motion similarity can be Tracking by detection. With the rapid development of ob-
computed by the IoU of the predicted box and the detection ject detection [49, 26, 48, 34, 10, 22, 57, 55], more and more
box. Figure 2 (b) is exactly the results after the first match- methods begin to use more powerful detectors to obtain
ing. Then, we perform the second matching between the higher tracking performance. The one-stage object detector
unmatched tracklets, i.e. the tracklet in red box, and the low RetinaNet [34] begin to be used by several methods such as

2
[38, 47]. CenterNet [86] is the most popular detector used pearance similarity are helpful in the long-range matching.
by most methods [85, 81, 68, 83, 65, 60, 63] for its simplic- An object can be re-identified using appearance similarity
ity and efficiency. The YOLO series detectors [48, 8] are after being occluded for a long period of time. Appear-
also used by a large number of methods [66, 32, 33, 15] for ance similarity can be measured by the cosine similarity of
its excellent balance of accuracy and speed. Most of these the Re-ID features. DeepSORT [67] adopts a stand-alone
methods directly use the detection boxes on a single image Re-ID model to extract appearance features from the de-
for tracking. tection boxes. Recently, joint detection and Re-ID models
However, the number of missing detections and very low [66, 81, 32, 38, 80, 46] becomes more and more popular
scoring detections begin to increase when occlusion or mo- because of their simplicity and efficiency.
tion blur happens in the video sequence, as is pointed out
Matching strategy. After similarity computation, match-
by video object detection methods [59, 40]. Therefore, the
ing strategy assigns identities to the objects. This can be
information of the previous frames are usually leveraged to
done by Hungarian Algorithm [30] or greedy assignment
enhance the video detection performance.
[85]. SORT [7] matches the detection boxes to the track-
Detection by tracking. Tracking can also adopted to help lets by once matching. DeepSORT [67] proposes a cas-
obtain more accurate detection boxes. Some methods [52, caded matching strategy which first matches the detection
88, 14, 13, 15, 12] use single object tracking (SOT) [6] or boxes to the most recent tracklets and then to the lost ones.
Kalman Filter [28] to predict the location of the tracklets MOTDT [12] first uses appearance similarity to match and
in the following frame and fuse the predicted boxes with then use the IoU similarity to match the unmatched track-
the detection boxes to enhance the detection results. Other lets. QuasiDense [46] turns the appearance similarity into
methods [82, 33] use tracked boxes in the previous frames probability by a bi-directional softmax operation and uses
to enhance feature representation of the following frame. a nearest neighbor search to accomplish matching. Atten-
Recently, Transformer-based [61, 19, 64, 37] detectors [11, tion mechanism [61] can directly propagate boxes between
89] are used by several methods [56, 41, 76] for its strong frames and perform association implicitly. Recent methods
ability to propagate boxes between frames. Our method also such as [41, 76] propose track queries to find the location of
utilize the similarity with tracklets to strength the reliability the tracked objects in the following frames. The matching
of detection boxes. is implicitly performed in the attention interaction process.
After obtaining the detection boxes by various detectors, All these methods focus on how to design better associ-
most MOT methods [66, 81, 46, 38, 32, 68, 56] only keep ation methods. However, we argue that the detection boxes
the high score detection boxes by a threshold, i.e. 0.5, and determines the upper bound of data association and we fo-
use those boxes as the input of data association. This is cus on how to make use of detection boxes from high scores
because the low score detection boxes contain many back- to low ones in the matching process.
grounds which harm the tracking performance. However,
we observe that many occluded objects can be correctly de- 3. BYTE
tected but have low scores. To reduce missing detections
and keep the persistence of trajectories, we keep all the de- We propose a simple, effective and generic data asso-
tection boxes and associate across every of them. ciation method, BYTE. Different from previous methods
[66, 81, 32, 46] which only keep the high score detection
2.2. Data Association boxes, we keep every detection box and separate them into
high score ones and low score ones. We first associate the
Data association is the core of multi-object tracking, high score detection boxes to the tracklets. Some tracklets
which first computes the similarity between tracklets and get unmatched because it does not match to an appropri-
detection boxes and then matches them according to the ate high score detection box, which usually happens when
similarity. occlusion, motion blur or size changing occurs. We then as-
Similarity metrics. Location, motion and appearance are sociate the low score detection boxes and these unmatched
useful cues for association. SORT [7] combines location tracklets to recover the objects in low score detection boxes
and motion cues in a very simple way. It first uses Kalman and filter out background, simultaneously. The pseudo-code
Filter [28] to predict the location of the tracklets in the new of BYTE is shown in Algorithm 1.
frame and then computes the IoU between the detection The input of BYTE is a video sequence V, along with
boxes and the predicted boxes as the similarity. Some re- an object detector Det and the Kalman Filter KF. We also
cent methods [85, 56, 68] design networks to learn object set three thresholds τhigh , τlow and . τhigh and τlow are
motions and achieve more robust results in cases of large the detection score thresholds and  is the tracking score
camera motion or low frame rate. Location and motion threshold. The output of BYTE is the tracks T of the video
similarity are accurate in the short-range matching. Ap- and each track contains the bounding box and identity of the

3
Algorithm 1: Pseudo-code of BYTE. high score detection boxes, we use Kalman Filter KF to pre-
Input: A video sequence V; object detector Det; Kalman Filter dict the new locations of each track in T (line 14 to 16 in
KF; detection score threshold τhigh , τlow ; tracking score Algorithm 1).
threshold 
Output: Tracks T of the video The first association is performed between the high score
1 Initialization: T ← ∅ detection boxes Dhigh and all the tracks T (including the
2 for frame fk in V do lost tracks Tlost ). The similarity is computed by the IoU
/* Figure 2(a) */
/* predict detection boxes & scores */ between the detection boxes Dhigh and the predicted box
3 Dk ← Det(fk ) of tracks T . Then, we use Hungarian Algorithm [30] to
4 Dhigh ← ∅ finish the matching based on the similarity. In particular,
5 Dlow ← ∅
if the IoU between the detection box and the tracklet box
6 for d in Dk do
7 if d.score > τhigh then is smaller than 0.2, we reject the matching. We keep the
8 Dhigh ← Dhigh ∪ {d} unmatched detections in Dremain and the unmatched tracks
9 end in Tremain (line 17 to 19 in Algorithm 1).
10 else if d.score > τlow then
11 Dlow ← Dlow ∪ {d} BYTE is highly flexible and can be compatible to other
12 end different association methods. For example, when BYTE is
13 end
combined with DeepSORT [67], Re-ID feature is added into
/* predict new locations of tracks */ * first association * in Algorithm 1, others are
14 for t in T do the same. In the experiments, we apply BYTE to 9 different
15 t ← KF(t)
16 end state-of-the-art trackers and achieve notable improvements
on almost all the metrics.
/* Figure 2(b) */
/* first association */ The second association is performed between the low
17 Associate T and Dhigh using IoU distance score detection boxes Dlow and the remaining tracks
18 Dremain ← remaining object boxes from Dhigh
19 Tremain ← remaining tracks from T
Tremain after the first association. We keep the unmatched
tracks in Tre−remain and just delete all the unmatched low
/* Figure 2(c) */ score detection boxes, since we view them as background.
/* second association */
20 Associate Tremain and Dlow using IoU distance (line 20 to 21 in Algorithm 1).
21 Tre−remain ← remaining tracks from Tremain
We find it important to use IoU as the similarity in the
/* delete unmatched tracks */ second association because the low score detection boxes
22 T ← T \ Tre−remain usually contains severe occlusion or motion blur and ap-
/* initialize new tracks */ pearance features are not reliable. Thus, when apply BYTE
23 for d in Dremain do to other Re-ID based trackers [66, 81, 46], we do not use
24 if d.score >  then appearance similarity in the second association.
25 T ← T ∪ {d}
26 end After the association, the unmatched tracks will be
27 end deleted from the tracklets. We do not list the procedure of
28 end
29 Return: T
track rebirth [67, 12, 85] in Algorithm 1 for simplicity. Ac-
tually, it is necessary for the long-range association to pre-
Track rebirth [67, 85] is not shown in the algorithm for simplicity. In green
serve the identity of the tracks. For the unmatched tracks
is the key of our method.
Tre−remain after the second association, we put them into
Tlost . For each track in Tlost , only when it exists for more
than a certain number of frames, i.e. 30, we delete it from
object in each frame.
the tracks T . Otherwise, we remain the lost tracks Tlost in
For each frame in the video, we predict the detection T (line 22 in Algorithm 1).
boxes and scores using the detector Det. We separate all
the detection boxes into two parts Dhigh and Dlow accord- Finally, we initialize new tracks from the unmatched
ing to the detection score thresholds τhigh and τlow . For the high score detection boxes Dremain after the first associ-
detection boxes whose scores are higher than τhigh , we put ation. For each detection box in Dremain , if its detection
them into the high score detection boxes Dhigh . For those score is higher than  and exists for two consecutive frames,
whose scores range from τlow to τhigh , we put them into we initialize a new track (line 23 to 27 in Algorithm 1).
the low score detection boxes Dlow (line 3 to 13 in Algo- The output of each individual frame is the bounding
rithm 1). boxes and identities of the tracks T in the current frame.
After separating the low score detection boxes and the Note that we do not output the boxes and identities of Tlost .

4
4. ByteTrack 5. Experiments
5.1. Setting
To put forwards the state-of-the-art performance of
MOT, we design a simple and strong tracker, named Datasets. We evaluate BYTE and ByteTrack on MOT17
ByteTrack, by equipping the high-performance detector [43] and MOT20 [16] datasets under the “private detection”
YOLOX [24] with our association method BYTE. protocol. Both datasets contain training sets and test sets,
without validation sets. For ablation studies, we use the
YOLOX switches the YOLO series detectors [48, 8] first half of each video in the training set of MOT17 for
to an anchor-free manner and conduct other advanced de- training and the last half for validation following [85]. We
tection techniques, including decoupled heads, strong data train on the combination of CrowdHuman dataset [54] and
augmentations, such as Mosaic [8] and Mixup [77], and ef- MOT17 half training set following [85, 56, 76, 68]. We
fective label assignment strategy SimOTA [23] to achieve add Cityperson [78] and ETHZ [20] for training following
state-of-the-art performance on object detection. [66, 81, 32] when testing on the test set of MOT17.
The backbone network is the same as YOLOv5 [1] which Metrics. We use the CLEAR metrics [5], including MOTA,
adopts an advanced CSPNet [62] backbone and an addi- FP, FN, IDs, etc., IDF1 [51] and HOTA [39] to evaluate dif-
tional PAN [36] head. There are two decoupled heads after ferent aspects of the tracking performance. MOTA is com-
the backbone network, one for regression and the other for puted based on FP, FN and IDs. Considering the amount
classification. An additional IoU-aware branch is added to of FP and FN are larger than IDs, MOTA focuses more
the regression head to predict the IoU between the predicted on the detection performance. IDF1 evaluates the identity
box and the ground truth box. The regression head directly preservation ability and focus more on the association per-
predicts four values in each location in the feature map, i.e., formance. HOTA is a very recently proposed metric which
two offsets in terms of the left-top corner of the grid, and explicitly balances the effect of performing accurate detec-
the height and width of the predicted box. The regression tion, association and localization.
head is supervised by GIoU loss [50] and the classification
Implementation details. For BYTE, the default high de-
and IoU heads are supervised by the binary cross entropy
tection score threshold τhigh is 0.6, the low threshold τlow
loss.
0.1 and the trajectory initialization score  0.7, unless oth-
The SimOTA label assignment strategy automatically se- erwise specified. In the linear assignment step, if the IoU
lect positive samples according to their cost to the ground between the detection box and the tracklet box is smaller
truth annotations. The cost is computed by a weighted than 0.2, the matching will be rejected. For the lost track-
sum of the classification cost and the box location cost lets, we keep it for 30 frames in case it appears again.
[87, 11, 55]. Then, it selects a number of dynamic top-k For ByteTrack, the detector is YOLOX [24] with
positive samples from a fixed size of areas around the object YOLOX-X as the backbone and COCO-pretrained model
center according to their cost. The advanced label assign- [35] as the initialized weights. The training schedule is
ment strategy notably increases the detection performance. 80 epochs on the combination of MOT17, CrowdHuman,
Cityperson and ETHZ. The input image size is 1440 ×800
We note MOT17 [43] requires the bounding boxes [85] and the shortest side ranges from 576 to 1024 during multi-
covering the whole body, even though the object is occluded scale training. The data augmentation includes Mosaic [8]
or partly out of the image. However, the default implemen- and Mixup [77]. The model is trained on 8 NVIDIA Tesla
tation of YOLOX clips the detection boxes inside the im- V100 GPU with batch size of 48. The optimizer is SGD
age area. To avoid the wrong detection results around the with weight decay of 5 × 10−4 and momentum of 0.9. The
image boundary, we modify YOLOX in terms of data pre- initial learning rate is 10−3 with 1 epoch warm-up and co-
processing and label assignment. We do not clip the bound- sine annealing schedule. The total training time is about
ing boxes inside the image during the data pre-processing 12 hours. Following [24], FPS is measured with FP16-
and data augmentation procedure. We only delete the boxes precision [42] and batch size of 1 on a single GPU.
which are fully outside the image after data augmentation.
In the SimOTA label assignment strategy, the positive sam- 5.2. Ablation Studies on BYTE
ples need to be around the center of the object, while the
center of the whole body boxes may lie out of the image, so Comparisons with other association methods. We com-
we clip the center of the object inside the image. pare BYTE with other popular association methods includ-
ing SORT [7], DeepSORT [67] and MOTDT [12]. The re-
MOT20 [16] clips the bounding box annotations inside sults are shown in Table 1.
the image in and thus we just use the original YOLOX. SORT can be seen as our baseline method because both

5
Method w/ Re-ID MOTA↑ IDF1↑ IDs↓ FPS because the second association in BYTE recovers the ob-
jects whose scores are lower than τhigh , and thus considers
SORT 74.6 76.9 291 30.1 every detection box regardless of the change of τhigh .
DeepSORT X 75.4 77.2 239 13.5
MOTDT X 75.8 77.6 273 11.1 Analysis on low score detection boxes. To prove the ef-
BYTE (ours) 76.6 79.3 159 29.6 fectiveness of BYTE, we collect the number of TPs and
FPs in the low score boxes obtained by BYTE. We use the
Table 1. Comparison of different data association methods on the half training set of MOT17 and CrowdHuman for training
MOT17 validation set. The best results are shown in bold. and evaluate on the half validation set of MOT17. First,
we keep all the low score detection boxes whose scores
range from τlow to τhigh and classify the TPs and FPs us-
ing ground truth annotations. Then, we select the tracking
results obtained by BYTE from low score detection boxes.
The results of each sequence are shown in Fig 4. We can
see that BYTE obtains notably more TPs than FPs from
the low score detection boxes even though some sequences
have much more FPs in all the detection boxes. The ob-
tained TPs notably increases MOTA from 74.6 to 76.6 as is
shown in Table 1.
Figure 3. Comparison of the performances of BYTE and SORT
under different detection score thresholds. The results are from Applications on other trackers. We apply BYTE on
the validation set of MOT17. 9 different state-of-the-arts trackers, including JDE [66],
CSTrack [32], FairMOT [81], TraDes [68], QuasiDense
[46], CenterTrack [85], Chained-Tracker [47], TransTrack
methods only use Kalman Filter to predict the object mo- [56] and MOTR [76]. Among these trackers, JDE, CSTrack,
tion. We can see that BYTE improves the MOTA metric FairMOT, TraDes use a combination of motion and Re-
of SORT from 74.6 to 76.6, IDF1 from 76.9 to 79.3 and ID similarity. QuasiDense uses Re-ID similarity alone.
decreases IDs from 291 to 159. This highlights the impor- CenterTrack and TraDes predict the motion similarity by
tance of the low score detection boxes and proves the ability the learned networks. Chained-Tracker adopts the chain
of BYTE to recover object boxes from low score one. structure and outputs the results of two consecutive frames
DeepSORT uses additional Re-ID models to enhance the simultaneously and associate in the same frame by IoU.
long-range association. We surprisingly find BYTE also has TransTrack and MOTR use the attention mechanism to
additional gains compared with DeepSORT. This suggests propagate boxes among frames. Their results are shown in
a simple Kalman Filter can perform long-range association the first line of each tracker in Table 2. To evaluate the effec-
and achieve better IDF1 and IDs when the detection boxes tiveness of BYTE, we design two different modes to apply
are accurate enough. We note that in severe occlusion cases, BYTE to these trackers.
Re-ID features are vulnerable and may lead to more identity
switches, instead, motion model behaves more reliably. • The first mode is to insert BYTE into the original asso-
MOTDT integrates motion-guided box propagation re- ciation methods of different trackers, as is shown in the
sults and detection results to associate unreliable detection second line of the results of each tracker in Table 2. Take
results with tracklets. Although sharing the similar motiva- FairMOT[81] for example, after the original association
tion, MOTDT is behind BYTE by a large margin. We ex- is done, we select all the unmatched tracklets and as-
plain that MOTDT uses propagated boxes as tracklet boxes, sociate them with the low score detection boxes follow-
which may lead to locating drifts in tracking. Instead, ing the * second association * in Algorithm 1.
BYTE uses low-score detection boxes to re-associate those Note that for the low score objects, the Re-ID features are
unmatched tracklets, therefore, tracklet boxes are more ac- not reliable so we only use the IoU between the detection
curacy. boxes and the tracklet boxes after motion prediction as
the similarity. We do not apply the first mode of BYTE to
Robustness to detection score threshold. The detection Chained-Tracker because we find it is difficult to imple-
score threshold τhigh is a sensitive hyper-parameter and ment in the chain structure.
needs to be carefully tuned in the task of multi-object track-
ing. We change it from 0.2 to 0.8 and compare the MOTA • The second mode is to directly use the detection boxes of
and IDF1 score of BYTE and SORT. The results are shown these trackers and associate using the whole procedure in
in Fig 3. From the results we can see that BYTE is more Algorithm 1, as is shown in the third line of the results of
robust to the detection score threshold than SORT. This is each tracker in Table 2.

6
Figure 4. Comparison of the number of TPs and FPs in all low score detection boxes and the low score tracked boxes obtained by BYTE.
The results are from the validation set of MOT17.

Method Similarity w/ BYTE MOTA↑ IDF1↑ FP↓ FN↓ IDs↓


JDE [66] Motion(K) + Re-ID 60.0 63.6 2923 18158 473
Motion(K) + Re-ID X 60.3 (+0.3) 64.1 (+0.5) 3065 17912 418
Motion(K) X 60.6 (+0.6) 66.0 (+2.4) 3082 17771 360
CSTrack [32] Motion(K) + Re-ID 68.0 72.3 1846 15075 325
Motion(K) + Re-ID X 69.2 (+1.2) 73.9 (+1.6) 2160 14128 285
Motion(K) X 69.3 (+1.3) 71.7 (-0.6) 2202 14068 279
FairMOT [81] Motion(K) + Re-ID 69.1 72.8 1976 14443 299
Motion(K) + Re-ID X 70.4 (+1.3) 74.2 (+1.4) 2288 13470 232
Motion(K) X 70.3 (+1.2) 73.2 (+0.4) 2189 13625 236
TraDes [68] Motion + Re-ID 68.2 71.7 1913 14962 285
Motion + Re-ID X 68.6 (+0.4) 71.1 (-0.6) 2253 14419 259
Motion(K) X 67.9 (-0.3) 72.0 (+0.3) 1822 15345 178
QuasiDense [46] Re-ID 67.3 67.8 2637 14605 377
Motion(K) + Re-ID X 67.7 (+0.4) 72.0 (+4.2) 2280 14856 281
Motion(K) X 67.9 (+0.6) 70.9 (+3.1) 2310 14746 258
CenterTrack [85] Motion 66.1 64.2 2442 15286 528
Motion X 66.3 (+0.2) 64.8 (+0.6) 2376 15445 334
Motion(K) X 67.4 (+1.3) 74.0 (+9.8) 1778 15641 144
Chained-Tracker [47] Chain 63.1 60.9 2955 16174 755
Motion(K) X 65.0 (+1.9) 66.7 (+5.8) 3303 15206 346
TransTrack [56] Attention 67.1 68.3 1652 15817 254
Attention X 68.6 (+1.5) 69.0 (+0.7) 2151 14515 232
Motion(K) X 68.3 (+1.2) 72.4 (+4.1) 1692 15189 181
MOTR [76] Attention 64.7 67.2 5278 13452 346
Attention X 64.3 (-0.4) 69.3 (+2.1) 5787 13220 263
Motion(K) X 65.7 (+1.0) 68.4 (+1.2) 1607 16651 260

Table 2. Results of applying BYTE to 9 different state-of-the-art trackers on the MOT17 validation set. “K” is short for Kalman Filter. In
green are the improvements of at least +1.0 point.

7
Input size MOTA↑ IDF1↑ IDs↓ Time (ms) half training set of MOT17, the performance achieves 75.8
MOTA, which already outperforms most methods. This
512 × 928 75.0 77.6 200 17.9+4.0 is because we use strong augmentations such as Mosaic
608 × 1088 75.6 76.4 212 21.8+4.0 [8] and Mixup [77]. When further adding CrowdHuman,
736 × 1280 76.2 77.4 188 26.2+4.2 Cityperson and ETHZ for training, we can achieve 76.7
800 × 1440 76.6 79.3 159 29.6+4.2 MOTA and 79.7 IDF1. The big improvement of IDF1 arises
Table 3. Comparison of different input sizes on the MOT17 vali- from that the CrowdHuman dataset can boost the detector to
dation set. The total running time is a combination of the detection recognize occluded person, therefore, making the Kalman
time and the association time. The best results are shown in bold. Filter generate smoother predictions and enhance the asso-
ciation ability of the tracker.
Training data Images MOTA↑ IDF1↑ IDs↓ The experiments on training data suggest that ByteTrack
is not data hungry. This is a big advantage for real applica-
MOT17 2.7K 75.8 76.5 205
tions, comparing with previous methods [81, 32, 63, 33] that
MOT17 + CH 22.0K 76.6 79.3 159
require more than 7 data sources [43, 20, 78, 70, 84, 18, 54]
MOT17 + CH + CE 26.6K 76.7 79.7 183
to achieve high performance.
Table 4. Comparison of different training data on the MOT17 val-
idation set. “MOT17” is short for the MOT17 half training set.
Visualization results. We show some visualization results
“CH” is short for the CrowdHuman dataset. “CE” is short for the of difficult cases which ByteTrack is able to handle in Fig-
Cityperson and ETHZ datasets. The best results are shown in bold. ure 5. We select 6 sequences from the half validation set
of MOT17 and generate the visualization results using the
model with 76.6 MOTA and 79.3 IDF1. The difficult cases
Interval MOTA↑ IDF1↑ FP↓ FN↓ IDs↓
include occlusion (i.e. MOT17-02, MOT17-04, MOT17-
No 76.6 79.3 3358 9081 159 05, MOT17-09, MOT17-13), motion blur (i.e. MOT17-10,
10 77.4 79.7 3638 8403 150 MOT17-13) and small objects (i.e. MOT17-13). The pedes-
20 78.3 80.2 3941 7606 146 trian in the middle frame with red triangle has low detection
30 78.3 80.2 4237 7337 147 score, which is obtained by our association method BYTE.
Table 5. Comparison of different interpolation intervals on the
The low score boxes not only decrease the number of miss-
MOT17 validation set. The best results are shown in bold. ing detection, but also play an important role for long-range
association. As we can see from all these difficult cases,
ByteTrack does not bring any identity switch and preserve
We can see that in both modes, BYTE can bring stable the identity effectively.
improvements over almost all the metrics including MOTA,
Tracklet interpolation. We notice that there are some
IDF1 and IDs. For example, BYTE increases CenterTrack
fully-occluded pedestrians in MOT17, whose visible ratio is
by 1.3 MOTA and 9.8 IDF1, Chained-Tracker by 1.9 MOTA
0 in the ground truth annotations. Since it is almost impos-
and 5.8 IDF1, TransTrack by 1.2 MOTA and 4.1 IDF1. The
sible to detect them by visual cues, we obtain these objects
results in Table 2 indicate that BYTE has strong generaliza-
by tracklet interpolation.
tion ability and can be easily applied to existing trackers to
Suppose we have a tracklet T , its tracklet box is lost due
obtain performance gain.
to occlusion from frame t1 to t2 . The tracklet box of T at
5.3. Ablation Studies on ByteTrack frame t1 is Bt1 ∈ R4 which contains the top left and bottom
right coordinate of the bounding box. Let Bt2 represent the
Speed v.s. accuracy. We evaluate the speed and accuracy tracklet box of T at frame t2 . We set a hyper-parameter σ
of ByteTrack using different size of input images during in- representing the max interval we perform tracklet interpola-
ference. All experiments use the same multi-scale training. tion, which means tracklet interpolation is performed when
The results are shown in Table 3. The input size during in- t2 − t1 ≤ σ, . The interpolated box of tracklet T at frame t
ference ranges from 512 × 928 to 800 × 1440. The running can be computed as follows:
time of the detector ranges from 17.9 ms to 30.0 ms and t − t1
Bt = Bt1 + (Bt2 − Bt1 ) , (1)
the association time is all around 4.0 ms. ByteTrack can t2 − t1
achieve 75.0 MOTA with 45.7 FPS running speed and 76.6
where t1 < t < t2 .
MOTA with 29.6 FPS running speed, which has advantages
As shown in Table 5, tracklet interpolation can im-
in practical applications.
prove MOTA from 76.6 to 78.3 and IDF1 from 79.3 to
Training data. We evaluate ByteTrack on the half valida- 80.2, when σ is 20. Tracklet interpolation is an effective
tion set of MOT17 using different combinations of training post-processing method to obtain the boxes of those fully-
data. The results are shown in Table 4. When only using the occluded objects.

8
MOT17-02 MOT17-04

MOT17-05 MOT17-09

MOT17-10 MOT17-13

Figure 5. Visualization results of ByteTrack. We select 6 sequences from the validation set of MOT17 and show the effectiveness of
ByteTrack to handle difficult cases such as occlusion and motion blur. The yellow triangle represents the high score box and the red
triangle represents the low score box. The same box color represents the same identity.

5.4. MOT Challenge Result low identity switches, which further indicates that associat-
ing every detection boxes is very effective under occlusion
We compare ByteTrack with the state-of-the-art trackers
cases.
on the test set of MOT17 and MOT20 under the private de-
tection protocol in Table 6 and Table 7, respectively. All the 6. Conclusion
results are directly obtained from the official MOT Chal-
lenge evaluation server1 . We present a simple yet effective data association
method BYTE for multi-object tracking. BYTE can be eas-
MOT17. ByteTrack ranks 1st among all the trackers on the
ily applied to existing trackers and achieve consistent im-
leaderboard of MOT17. Not only does it achieve the best
provements. We also propose a strong tracker ByteTrack,
accuracy (i.e. 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA), but
which achieves 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA
also runs with highest running speed (30 FPS). It outper-
on MOT17 test set with 30 FPS, ranking 1st among all
forms the second-performance tracker [73] by a large mar-
the trackers on the leaderboard. ByteTrack is very robust
gin (i.e. +3.3 MOTA, +5.3 IDF1 and +3.4 HOTA). Also, we
to occlusion for its accurate detection performance and the
use less training data than many high performance methods
help of associating low score detection boxes. It also sheds
such as [81, 32, 63, 53, 33] (29K images vs. 73K images). It
light on how to make the best use of detection results to
is worth noting that we only use the simplest similarity com-
enhance multi-object tracking. We hope the high accuracy,
putation method Kalman Filter in the association step com-
fast speed and simplicity of ByteTrack can make it attractive
pared to other methods [81, 32, 46, 65, 76, 56] which addi-
in real applications.
tionally use Re-ID similarity or attention mechanisms. All
these indicate that ByteTrack is a simple and strong tracker.
References
MOT20. Compared with MOT17, MOT20 has much more
[1] Yolov5. https://github.com/ultralytics/
crowded scenarios and occlusion cases. The average num- yolov5, 2020.
ber of pedestrians in an image is 170 in the test set of [2] S.-H. Bae and K.-J. Yoon. Robust online multi-object track-
MOT20. ByteTrack also ranks 1st among all the trackers on ing based on tracklet confidence and online discriminative
the leaderboard of MOT20 and outperforms other state-of- appearance learning. In Proceedings of the IEEE conference
the-art trackers by a large margin on almost all the metrics. on computer vision and pattern recognition, pages 1218–
For example, it increases MOTA from 68.6 to 77.8, IDF1 1225, 2014.
from 71.4 to 75.2 and decreases IDs by 71% from 4209 to [3] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua. Multiple
1223. It is worth noting that ByteTrack achieves extremely object tracking using k-shortest paths optimization. IEEE
transactions on pattern analysis and machine intelligence,
1 https://motchallenge.net 33(9):1806–1819, 2011.

9
Tracker MOTA↑ IDF1↑ HOTA↑ MT↑ ML↓ FP↓ FN↓ IDs↓ FPS↑
DAN [58] 52.4 49.5 39.3 21.4% 30.7% 25423 234592 8431 <3.9
Tube TK [45] 63.0 58.6 48.0 31.2% 19.9% 27060 177483 4137 3.0
MOTR [76] 65.1 66.4 - 33.0% 25.2% 45486 149307 2049 -
Chained-Tracker [47] 66.6 57.4 49.0 37.8% 18.5% 22284 160491 5529 6.8
CenterTrack [85] 67.8 64.7 52.2 34.6% 24.6% 18498 160332 3039 17.5
QuasiDense [46] 68.7 66.3 53.9 40.6% 21.9% 26589 146643 3378 20.3
TraDes [68] 69.1 63.9 52.7 36.4% 21.5% 20892 150060 3555 17.5
MAT [25] 69.5 63.1 53.8 43.8% 18.9% 30660 138741 2844 9.0
SOTMOT [83] 71.0 71.9 - 42.7% 15.3% 39537 118983 5184 16.0
TransCenter [72] 73.2 62.2 54.5 40.8% 18.5% 23112 123738 4614 1.0
GSDT [65] 73.2 66.5 55.2 41.7% 17.5% 26397 120666 3891 4.9
Semi-TCL [31] 73.3 73.2 59.8 41.8% 18.7% 22944 124980 2790 -
FairMOT [81] 73.7 72.3 59.3 43.2% 17.3% 27507 117477 3303 25.9
RelationTrack [75] 73.8 74.7 61.0 41.7% 23.2% 27999 118623 1374 8.5
PermaTrackPr [60] 73.8 68.9 55.5 43.8% 17.2% 28998 115104 3699 11.9
CSTrack [32] 74.9 72.6 59.3 41.5% 17.5% 23847 114303 3567 15.8
TransTrack [56] 75.2 63.5 54.1 55.3% 10.2% 50157 86442 3603 10.0
FUFET [53] 76.2 68.0 57.9 51.1% 13.6% 32796 98475 3237 6.8
SiamMOT [33] 76.3 72.3 - 44.8% 15.5% - - - 12.8
CorrTracker [63] 76.5 73.6 60.7 47.6% 12.7% 29808 99510 3369 15.6
TransMOT [15] 76.7 75.1 61.7 51.0% 16.4% 36231 93150 2346 9.6
ReMOT [73] 77.0 72.0 59.7 51.7% 13.8% 33204 93612 2853 1.8
ByteTrack (ours) 80.3 77.3 63.1 53.2% 14.5% 25491 83721 2196 29.6

Table 6. Comparison of the state-of-the-art methods under the “private detector” protocol on MOT17 test set. The best results are shown in
bold. MOT17 contains rich scenes and half of the sequences are captured with camera motion. ByteTrack ranks 1st among all the trackers
on the leaderboard of MOT17 and outperforms the second one ReMOT by a large margin on almost all the metrics. It also has the highest
running speed among all the trackers.

Tracker MOTA↑ IDF1↑ HOTA↑ MT↑ ML↓ FP↓ FN↓ IDs↓ FPS↑
MLT [79] 48.9 54.6 43.2 30.9% 22.1% 45660 216803 2187 3.7
FairMOT [81] 61.8 67.3 54.6 68.8% 7.6% 103440 88901 5243 13.2
TransCenter [72] 61.9 50.4 - 49.4% 15.5% 45895 146347 4653 1.0
TransTrack [56] 65.0 59.4 48.5 50.1% 13.4% 27197 150197 3608 7.2
CorrTracker [63] 65.2 69.1 - 66.4% 8.9% 79429 95855 5183 8.5
Semi-TCL [31] 65.2 70.1 55.3 61.3% 10.5% 61209 114709 4139 -
CSTrack [32] 66.6 68.6 54.0 50.4% 15.5% 25404 144358 3196 4.5
GSDT [65] 67.1 67.5 53.6 53.1% 13.2% 31913 135409 3131 0.9
SiamMOT [33] 67.1 69.1 - 49.0% 16.3% - - - 4.3
RelationTrack [75] 67.2 70.5 56.5 62.2% 8.9% 61134 104597 4243 2.7
SOTMOT [83] 68.6 71.4 - 64.9% 9.7% 57064 101154 4209 8.5
ByteTrack (ours) 77.8 75.2 61.3 69.2% 9.5% 26249 87594 1223 17.5

Table 7. Comparison of the state-of-the-art methods under the “private detector” protocol on MOT20 test set. The best results are shown in
bold. The scenes in MOT20 are much more crowded than those in MOT17. ByteTrack ranks 1st among all the trackers on the leaderboard
of MOT20 and outperforms the second one SOTMOT by a large margin on all the metrics. It also has the highest running speed among all
the trackers.

10
[4] P. Bergmann, T. Meinhardt, and L. Leal-Taixe. Tracking [20] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mobile
without bells and whistles. In ICCV, pages 941–951, 2019. vision system for robust multi-person tracking. In CVPR,
[5] K. Bernardin and R. Stiefelhagen. Evaluating multiple ob- pages 1–8. IEEE, 2008.
ject tracking performance: the clear mot metrics. EURASIP [21] P. Felzenszwalb, D. McAllester, and D. Ramanan. A dis-
Journal on Image and Video Processing, 2008:1–10, 2008. criminatively trained, multiscale, deformable part model. In
[6] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and CVPR, pages 1–8. IEEE, 2008.
P. H. Torr. Fully-convolutional siamese networks for object [22] J. Fu, L. Zong, Y. Li, K. Li, B. Yang, and X. Liu. Model
tracking. In European conference on computer vision, pages adaption object detection system for robot. In 2020 39th Chi-
850–865. Springer, 2016. nese Control Conference (CCC), pages 3659–3664. IEEE,
[7] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Sim- 2020.
ple online and realtime tracking. In ICIP, pages 3464–3468. [23] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun. Ota: Optimal
IEEE, 2016. transport assignment for object detection. In Proceedings of
[8] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao. Yolov4: the IEEE/CVF Conference on Computer Vision and Pattern
Optimal speed and accuracy of object detection. arXiv Recognition, pages 303–312, 2021.
preprint arXiv:2004.10934, 2020. [24] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun. Yolox: Exceeding
[9] G. Brasó and L. Leal-Taixé. Learning a neural solver for yolo series in 2021. arXiv preprint arXiv:2107.08430, 2021.
multiple object tracking. In Proceedings of the IEEE/CVF [25] S. Han, P. Huang, H. Wang, E. Yu, D. Liu, X. Pan, and
Conference on Computer Vision and Pattern Recognition, J. Zhao. Mat: Motion-aware multi-object tracking. arXiv
pages 6247–6257, 2020. preprint arXiv:2009.04794, 2020.
[10] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
quality object detection. In CVPR, pages 6154–6162, 2018. In ICCV, pages 2961–2969, 2017.
[11] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, [27] A. Hornakova, R. Henschel, B. Rosenhahn, and P. Swo-
and S. Zagoruyko. End-to-end object detection with trans- boda. Lifted disjoint paths with application in multiple object
formers. In European Conference on Computer Vision, pages tracking. In International Conference on Machine Learning,
213–229. Springer, 2020. pages 4364–4375. PMLR, 2020.
[12] L. Chen, H. Ai, Z. Zhuang, and C. Shang. Real-time multiple [28] R. E. Kalman. A new approach to linear filtering and predic-
people tracking with deeply learned candidate selection and tion problems. J. Fluids Eng., 82(1):35–45, 1960.
person re-identification. In 2018 IEEE International Con- [29] T. Khurana, A. Dave, and D. Ramanan. Detecting invisible
ference on Multimedia and Expo (ICME), pages 1–6. IEEE, people. arXiv preprint arXiv:2012.08419, 2020.
2018. [30] H. W. Kuhn. The hungarian method for the assignment prob-
[13] P. Chu, H. Fan, C. C. Tan, and H. Ling. Online multi-object lem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
tracking with instance-aware tracker and dynamic model re- [31] W. Li, Y. Xiong, S. Yang, M. Xu, Y. Wang, and W. Xia.
freshment. In 2019 IEEE Winter Conference on Applications Semi-tcl: Semi-supervised track contrastive representation
of Computer Vision (WACV), pages 161–170. IEEE, 2019. learning. arXiv preprint arXiv:2107.02396, 2021.
[14] P. Chu and H. Ling. Famnet: Joint learning of feature, affin- [32] C. Liang, Z. Zhang, Y. Lu, X. Zhou, B. Li, X. Ye, and J. Zou.
ity and multi-dimensional assignment for online multiple ob- Rethinking the competition between detection and reid in
ject tracking. In ICCV, pages 6172–6181, 2019. multi-object tracking. arXiv preprint arXiv:2010.12138,
[15] P. Chu, J. Wang, Q. You, H. Ling, and Z. Liu. Transmot: 2020.
Spatial-temporal graph transformer for multiple object track- [33] C. Liang, Z. Zhang, X. Zhou, B. Li, Y. Lu, and W. Hu. One
ing. arXiv preprint arXiv:2104.00194, 2021. more check: Making” fake background” be tracked again.
[16] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, arXiv preprint arXiv:2104.09441, 2021.
I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé. Mot20: [34] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
A benchmark for multi object tracking in crowded scenes. loss for dense object detection. In ICCV, pages 2980–2988,
arXiv preprint arXiv:2003.09003, 2020. 2017.
[17] C. Dicle, O. I. Camps, and M. Sznaier. The way they move: [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
Tracking multiple targets with similar appearance. In Pro- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
ceedings of the IEEE international conference on computer mon objects in context. In ECCV, pages 740–755. Springer,
vision, pages 2304–2311, 2013. 2014.
[18] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian [36] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation
detection: A benchmark. In CVPR, pages 304–311. IEEE, network for instance segmentation. In Proceedings of the
2009. IEEE conference on computer vision and pattern recogni-
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, tion, pages 8759–8768, 2018.
X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, [37] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
G. Heigold, S. Gelly, et al. An image is worth 16x16 words: B. Guo. Swin transformer: Hierarchical vision transformer
Transformers for image recognition at scale. arXiv preprint using shifted windows. arXiv preprint arXiv:2103.14030,
arXiv:2010.11929, 2020. 2021.

11
[38] Z. Lu, V. Rathod, R. Votel, and J. Huang. Retinatrack: On- [54] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun.
line single stage joint detection and tracking. In Proceedings Crowdhuman: A benchmark for detecting human in a crowd.
of the IEEE/CVF conference on computer vision and pattern arXiv preprint arXiv:1805.00123, 2018.
recognition, pages 14668–14678, 2020. [55] P. Sun, Y. Jiang, E. Xie, W. Shao, Z. Yuan, C. Wang, and
[39] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal- P. Luo. What makes for end-to-end object detection? In Pro-
Taixé, and B. Leibe. Hota: A higher order metric for evaluat- ceedings of the 38th International Conference on Machine
ing multi-object tracking. International journal of computer Learning, volume 139 of Proceedings of Machine Learning
vision, 129(2):548–578, 2021. Research, pages 9934–9944. PMLR, 2021.
[40] H. Luo, W. Xie, X. Wang, and W. Zeng. Detect or track: To- [56] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong,
wards cost-effective video object detection/tracking. In Pro- Z. Yuan, C. Wang, and P. Luo. Transtrack: Multiple-object
ceedings of the AAAI Conference on Artificial Intelligence, tracking with transformer. arXiv preprint arXiv:2012.15460,
volume 33, pages 8803–8810, 2019. 2020.
[41] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichten- [57] P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan,
hofer. Trackformer: Multi-object tracking with transformers. M. Tomizuka, L. Li, Z. Yuan, C. Wang, et al. Sparse r-
arXiv preprint arXiv:2101.02702, 2021. cnn: End-to-end object detection with learnable proposals.
[42] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, In Proceedings of the IEEE/CVF Conference on Computer
D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, Vision and Pattern Recognition, pages 14454–14463, 2021.
G. Venkatesh, et al. Mixed precision training. arXiv preprint [58] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah. Deep
arXiv:1710.03740, 2017. affinity network for multiple object tracking. IEEE transac-
[43] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler. tions on pattern analysis and machine intelligence, 2019.
Mot16: A benchmark for multi-object tracking. arXiv [59] P. Tang, C. Wang, X. Wang, W. Liu, W. Zeng, and J. Wang.
preprint arXiv:1603.00831, 2016. Object detection in videos by high quality object linking.
[44] A. Milan, S. Roth, and K. Schindler. Continuous energy min- IEEE transactions on pattern analysis and machine intelli-
imization for multitarget tracking. IEEE transactions on pat- gence, 42(5):1272–1278, 2019.
tern analysis and machine intelligence, 36(1):58–72, 2013. [60] P. Tokmakov, J. Li, W. Burgard, and A. Gaidon. Learn-
[45] B. Pang, Y. Li, Y. Zhang, M. Li, and C. Lu. Tubetk: Adopt- ing to track with object permanence. arXiv preprint
ing tubes to track multi-object in a one-step training model. arXiv:2103.14258, 2021.
In Proceedings of the IEEE/CVF Conference on Computer [61] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
Vision and Pattern Recognition, pages 6308–6318, 2020. A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all
[46] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu. you need. In Advances in neural information processing sys-
Quasi-dense similarity learning for multiple object tracking. tems, pages 5998–6008, 2017.
In Proceedings of the IEEE/CVF Conference on Computer [62] C.-Y. Wang, H.-Y. M. Liao, Y.-H. Wu, P.-Y. Chen, J.-
Vision and Pattern Recognition, pages 164–173, 2021. W. Hsieh, and I.-H. Yeh. Cspnet: A new backbone that
[47] J. Peng, C. Wang, F. Wan, Y. Wu, Y. Wang, Y. Tai, C. Wang, can enhance learning capability of cnn. In Proceedings of
J. Li, F. Huang, and Y. Fu. Chained-tracker: Chaining paired the IEEE/CVF conference on computer vision and pattern
attentive regression results for end-to-end joint multiple- recognition workshops, pages 390–391, 2020.
object detection and tracking. In European Conference on [63] Q. Wang, Y. Zheng, P. Pan, and Y. Xu. Multiple ob-
Computer Vision, pages 145–161. Springer, 2020. ject tracking with correlation learning. In Proceedings of
[48] J. Redmon and A. Farhadi. Yolov3: An incremental improve- the IEEE/CVF Conference on Computer Vision and Pattern
ment. arXiv preprint arXiv:1804.02767, 2018. Recognition, pages 3876–3886, 2021.
[49] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards [64] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu,
real-time object detection with region proposal networks. In P. Luo, and L. Shao. Pyramid vision transformer: A versatile
Advances in neural information processing systems, pages backbone for dense prediction without convolutions. arXiv
91–99, 2015. preprint arXiv:2102.12122, 2021.
[50] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and [65] Y. Wang, K. Kitani, and X. Weng. Joint object detection
S. Savarese. Generalized intersection over union: A metric and multi-object tracking with graph neural networks. arXiv
and a loss for bounding box regression. In Proceedings of preprint arXiv:2006.13164, 2020.
the IEEE/CVF Conference on Computer Vision and Pattern [66] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang. Towards
Recognition, pages 658–666, 2019. real-time multi-object tracking. In Computer Vision–ECCV
[51] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. 2020: 16th European Conference, Glasgow, UK, August 23–
Performance measures and a data set for multi-target, multi- 28, 2020, Proceedings, Part XI 16, pages 107–122. Springer,
camera tracking. In ECCV, pages 17–35. Springer, 2016. 2020.
[52] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro. Online [67] N. Wojke, A. Bewley, and D. Paulus. Simple online and
multi-target tracking with strong and weak detections. In realtime tracking with a deep association metric. In 2017
ECCV, pages 84–99. Springer, 2016. IEEE international conference on image processing (ICIP),
[53] C. Shan, C. Wei, B. Deng, J. Huang, X.-S. Hua, X. Cheng, pages 3645–3649. IEEE, 2017.
and K. Liang. Tracklets predicting based adaptive graph [68] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan.
tracking. arXiv preprint arXiv:2010.09015, 2020. Track to detect and segment: An online multi-object tracker.

12
In Proceedings of the IEEE/CVF Conference on Computer [85] X. Zhou, V. Koltun, and P. Krähenbühl. Tracking objects as
Vision and Pattern Recognition, pages 12352–12361, 2021. points. In European Conference on Computer Vision, pages
[69] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: 474–490. Springer, 2020.
Online multi-object tracking by decision making. In ICCV, [86] X. Zhou, D. Wang, and P. Krähenbühl. Objects as points.
pages 4705–4713, 2015. arXiv preprint arXiv:1904.07850, 2019.
[70] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detec- [87] B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun.
tion and identification feature learning for person search. In Autoassign: Differentiable label assignment for dense object
CVPR, pages 3415–3424, 2017. detection. arXiv preprint arXiv:2007.03496, 2020.
[71] J. Xu, Y. Cao, Z. Zhang, and H. Hu. Spatial-temporal re- [88] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M.-H. Yang.
lation networks for multi-object tracking. In Proceedings Online multi-object tracking with dual matching attention
of the IEEE/CVF International Conference on Computer Vi- networks. In Proceedings of the European Conference on
sion, pages 3988–3998, 2019. Computer Vision (ECCV), pages 366–382, 2018.
[72] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda- [89] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai. De-
Pineda. Transcenter: Transformers with dense queries for formable detr: Deformable transformers for end-to-end ob-
multiple-object tracking. arXiv preprint arXiv:2103.15145, ject detection. arXiv preprint arXiv:2010.04159, 2020.
2021.
[73] F. Yang, X. Chang, S. Sakti, Y. Wu, and S. Nakamura. Re-
mot: A model-agnostic refinement for multiple object track-
ing. Image and Vision Computing, 106:104091, 2021.
[74] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast
and accurate cnn object detector with scale dependent pool-
ing and cascaded rejection classifiers. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pages 2129–2137, 2016.
[75] E. Yu, Z. Li, S. Han, and H. Wang. Relationtrack: Relation-
aware multiple object tracking with decoupled representa-
tion. arXiv preprint arXiv:2105.04322, 2021.
[76] F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei.
Motr: End-to-end multiple-object tracking with transformer.
arXiv preprint arXiv:2105.03247, 2021.
[77] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz.
mixup: Beyond empirical risk minimization. arXiv preprint
arXiv:1710.09412, 2017.
[78] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A di-
verse dataset for pedestrian detection. In CVPR, pages 3213–
3221, 2017.
[79] Y. Zhang, H. Sheng, Y. Wu, S. Wang, W. Ke, and Z. Xiong.
Multiplex labeling graph for near-online tracking in crowded
scenes. IEEE Internet of Things Journal, 7(9):7892–7902,
2020.
[80] Y. Zhang, C. Wang, X. Wang, W. Liu, and W. Zeng. Voxel-
track: Multi-person 3d human pose estimation and tracking
in the wild. arXiv preprint arXiv:2108.02452, 2021.
[81] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu. Fairmot:
On the fairness of detection and re-identification in multiple
object tracking. arXiv preprint arXiv:2004.01888, 2020.
[82] Z. Zhang, D. Cheng, X. Zhu, S. Lin, and J. Dai. Integrated
object detection and tracking with tracklet-conditioned de-
tection. arXiv preprint arXiv:1811.11167, 2018.
[83] L. Zheng, M. Tang, Y. Chen, G. Zhu, J. Wang, and H. Lu. Im-
proving multiple object tracking with single object tracking.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 2453–2462, 2021.
[84] L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and
Q. Tian. Person re-identification in the wild. In CVPR, pages
1367–1376, 2017.

13

You might also like