Open AccessArticle

Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern

Yubin Yuan

Yiquan Wu

^*,

Langyue Zhao

Yaxuan Pang

and

Yuqi Liu

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

Author to whom correspondence should be addressed.

Drones 2024, 8(8), 349; https://doi.org/10.3390/drones8080349

Submission received: 23 June 2024 / Revised: 22 July 2024 / Accepted: 26 July 2024 / Published: 28 July 2024

Download

Browse Figures

Figure 1
GAO-Tracker framework. "> Figure 2
Holistic trans-detector. "> Figure 3
Holistic transformer. "> Figure 4
Holistic self-attention. We initially partition the feature map into 4 × 4 grids. While the central 4 × 4 grid serves as the query window, we extract tokens at three granularity levels of 1 × 1, 2 × 2, and 4 × 4, respectively, from surrounding regions to serve as its keys and values. This results in tokens with dimensions of 8 × 8, 6 × 6, and 5 × 5. Ultimately, these tokens from the three levels are concatenated to compute the keys and values for the 4 × 4 = 16 tokens (queries) within the window. "> Figure 5
GAO trajectory matching module. "> Figure 6
Appear-IOU trajectory matching. "> Figure 7
Gau-IOU trajectory matching. "> Figure 8
OSPA-IOU trajectory matching. "> Figure 9
Comparison of detection results on the VisDrone dataset. "> Figure 10
Comparison of detection results on the UAVDT dataset. "> Figure 11
Tracking results of GAO-Tracker on the VisDrone dataset. "> Figure 12
Tracking results of GAO-Tracker on the UAVDT dataset. ">

Review Reports Versions Notes

Abstract

Drone aerial videos have immense potential in surveillance, rescue, agriculture, and urban planning. However, accurately tracking multiple objects in drone aerial videos faces challenges like occlusion, scale variations, and rapid motion. Current joint detection and tracking methods often compromise accuracy. We propose a drone multiple object tracking algorithm based on a holistic transformer and multiple feature trajectory matching pattern to overcome these challenges. The holistic transformer captures local and global interaction information, providing precise detection and appearance features for tracking. The tracker includes three components: preprocessing, trajectory prediction, and matching. Preprocessing categorizes detection boxes based on scores, with each category adopting specific matching rules. Trajectory prediction employs the visual Gaussian mixture probability hypothesis density method to integrate visual detection results to forecast object motion accurately. The multiple feature pattern introduces Gaussian, Appearance, and Optimal subpattern assignment distances for different detection box types (GAO trajectory matching pattern) in the data association process, enhancing tracking robustness. We perform comparative validations on the vision-meets-drone (VisDrone) and the unmanned aerial vehicle benchmarks; the object detection and tracking (UAVDT) datasets affirm the algorithm’s effectiveness: it obtained 38.8% and 61.7% MOTA, respectively. Its potential for seamless integration into practical engineering applications offers enhanced situational awareness and operational efficiency in drone-based missions.

Keywords:

multiple object tracking; transformer; detection confidence; multiple feature matching

1. Introduction

In recent years, with the rapid development of drone technology, drone aerial videos have become an effective means of acquiring high-resolution, wide-coverage areas and hold significant potential in various applications such as surveillance, rescue operations, agriculture, and urban planning [1]. Drone aerial videos capture a wide range of object categories, including human activities, vehicles, buildings and infrastructure, and natural environments, among others, providing rich data that endow drones with the capability to monitor and track various objects in different application scenarios. In this context, multiple object tracking (MOT) has become particularly important for processing drone aerial videos, allowing systems to track and monitor multiple objects, thus enabling a more comprehensive range of applications such as object tracking, behavior analysis, and environmental monitoring. However, multi-object tracking in drone aerial videos faces numerous challenges, including object occlusion, object variations at different scales, rapid object motion, complex environmental conditions, and data noise. Traditional multi-object tracking methods have limitations in addressing these issues, thus requiring more advanced techniques to enhance tracking performance [2].

The majority of multi-object tracking methods for drone aerial videos are based on detection. These methods initially identify objects in each frame using object detection algorithms, then employ data association, motion estimation, and filter updating to resolve occlusion and scale variations. Long-term tracking may involve object re-identification to handle object loss. Maintaining object trajectory information and conducting analyses improves the system’s robustness in complex environments. Additionally, to further improve efficiency, some researchers synchronize detection and tracking, integrating both technologies to address the challenges posed by the wide variety and complex appearances of objects in drone aerial videos.

Transformer models, known for their self-attention mechanism and parallel computing capabilities, have revolutionized natural language processing and computer vision [3]. Their versatility extends from vision transformers to full transformer models, and they enable breakthroughs in tasks like image classification, object detection, and semantic segmentation; they even branch into action recognition, object tracking, and scene flow estimation. In drone aerial video analyses, transformers offer fresh perspectives for multi-object detection and tracking. Unlike convolutional neural networks, transformers emphasize global context interactions alongside local contexts, enhancing understanding of spatial relationships. However, the computational expense of fine-grained self-attention in high-resolution images poses challenges. Recent studies explore solutions like coarse-grained global or fine-grained local self-attention to alleviate the computational burden, albeit at the cost of simultaneously modeling short- and long-distance visual dependencies [4].

Given these challenges and the transformative potential of transformer models, we are motivated to explore and develop advanced multi-object tracking methods that leverage the strengths of transformers. Therefore, we propose a multi-object tracking method named GAO-Tracker, which is based on a holistic transformer and multiple feature trajectory matching pattern, to address various challenges in drone aerial videos. Our goal is to overcome the limitations of traditional approaches and enhance the performance and robustness of MOT in drone aerial videos, enabling more accurate and reliable tracking in diverse and complex environments.

The remaining sections of this paper are organized as follows: Section 2, Related Work, reviews and discusses the latest advancements in the field of multiple object tracking (MOT). We analyze current mainstream and cutting-edge technologies, including object-feature-based methods, joint detection and tracking methods, and transformer-based methods, providing a solid theoretical foundation and practical background for this research. Section 3, Methodology, details our proposed GAO-Tracker method for multi-object tracking. We delve into the core concepts, including the use of a holistic transformer and multiple feature trajectory matching pattern to address various challenges in drone aerial videos. We describe the model structure, algorithm workflow, and implementation details. Section 4, Experiments, presents extensive experiments and performance evaluations of GAO-Tracker. We test the method on several public datasets and compare it with state-of-the-art methods. The results demonstrate GAO-Tracker’s superior performance and robustness in complex scenarios. Section 5, Discussion, provides an in-depth analysis of the experimental results. We discuss GAO-Tracker’s performance in different scenarios, analyze its strengths and limitations, and suggest potential improvements and future research directions. Section 6, Conclusion, summarizes the main contributions and findings of this paper. We reiterate GAO-Tracker’s innovations in enhancing multi-object tracking performance in drone aerial videos and discuss its prospects and potential for practical applications.

2. Related Work

This section aims to comprehensively review and discuss the latest research advancements in the field of multiple object tracking. By deeply analyzing current mainstream and cutting-edge technologies, we establish a solid theoretical foundation and practical background for this study. First, we focus on the basic framework and challenges of multiple object tracking. Then, we detail several core methods: object-feature-based multi-object tracking methods, which achieve continuous tracking by extracting and utilizing the appearance, motion, and other feature information of objects; joint detection and tracking multi-object methods, which tightly integrate object detection and tracking tasks to enhance the overall performance and efficiency of the system; and finally, transformer-based multi-object tracking methods, given the transformer model’s outstanding performance in sequence data processing. We explore how these methods utilize attention mechanisms to achieve precise and robust object tracking in complex scenarios. Through this review and analysis, we not only present the latest achievements in the MOT field but also highlight the current research gaps and shortcomings, leading to the research motivation and main contributions of this paper. Our goal is to provide new insights and solutions for the development of multi-object tracking technology.

2.1. Multiple Object Tracking

Multi-object tracking is a highly regarded technology, and its wide range of applications has attracted widespread interest among scholars. In the early stages of research, researchers primarily focused on applying optimization algorithms to derive object trajectories [5]. The IOUTracker, which relies solely on the bounding box intersection over union (IOU), was the simplest early multi-object tracking method [6]. Researchers gradually introduced motion models and Kalman filters to predict the positions of objects in the next frame [7]. Although these improvements made multi-object tracking algorithms faster and significantly improved their performance, the algorithms performed poorly in complex occlusion and object loss situations. To address these challenges, researchers introduced re-identification (ReID) features as appearance models, using visual features of objects between different frames to match objects and improve the accuracy of associations between trajectories and detection results [8]. In addition to ReID, some studies have utilized image segmentation techniques to identify and track objects, thereby better handling occlusion situations [9]. Furthermore, some researchers have begun to use recurrent neural networks or attention mechanisms to model the spatiotemporal relationships between objects, thereby improving tracking accuracy and stability. However, these methods often employ a single matching approach, neglecting the different characteristics of different types of objects. Moreover, introducing these different technological approaches into tracking systems can result in suboptimal tracking results, limiting effectiveness.

2.2. Object-Feature-Based Multi-Object Tracking Methods

Benefiting from the rapid development of object detectors, object feature modeling has become widely used in multi-object tracking algorithms from the perspective of drones. It achieves multi-object tracking by capturing unique features of objects such as color, texture, and optical flow. These extracted features must be distinctive in order to discriminate different objects in the feature space effectively. Once these features are extracted, similarity criteria can be utilized to find the most similar objects in the next frame, thus enabling multi-object tracking. SCTrack adopts a three-stage data association method that combines object appearance models, spatial distances, and explicit occlusion handling units. The system relies on the motion patterns of tracked objects and considers environmental constraints, thus exhibiting good performance in handling occluded objects [10]. To address the issue of the subjective setting of fusion ratios between appearance and motion, which often merge appearance similarity and motion consistency in the latest frame, the appearance similarity between objects and surrounding objects is computed, object motion is predicted using Social LSTM networks, and weighted appearance similarity and motion predictions are used to generate associations between the current object and the object in the previous frame [11]. However, due to the significant increase in computational costs, false detections, drone aerial backgrounds, and other issues associated with handling large numbers of object detections and association computations, these methods need to overcome various challenges in maintaining accuracy while mitigating computational costs, false detections, object associations, and so on.

2.3. Joint Detection and Tracking Multi-Object Methods

To enhance the computational speed of the entire drone aerial multi-object tracking system, researchers have actively explored methods that combine object detection and feature extraction to achieve greater sharing in computation. JDE was the first attempt at this approach and innovatively integrated the feature extraction branch into the single-stage detector YOLOv3 [12]. Conversely, Fairmont balanced the handling between detection and recognition tasks by adopting the anchor-free detector CenterNet to reduce anchor ambiguities [13]. In addition to these joint detection and feature embedding methods, several other single-stage trackers have emerged. GLOA designed global–local perception blocks to extract scale variance feature information from input frames. Adding identity embedding branches to the prediction heads outputs more discriminative identity information [14]. CenterTrack [15] and Chained Tracker [16], on the other hand, use multi-frame methods to predict bounding boxes in consecutive frames, facilitating efficient short-term associations that eventually form long-term object trajectories. However, it is essential to note that these technologies often generate many identity switches due to the difficulty of capturing long-term dependencies. Additionally, these methods cannot simultaneously consider multiple features of objects and differences in features among different categories, resulting in the easy loss of tracking for some small objects.

2.4. Transformer-Based Multi-Object Tracking Methods

In recent years, transformer-based models have achieved significant success in the field of computer vision, primarily excelling in the domain of object detection. This has given rise to several transformer-based methods making strides in drone multi-object tracking. Some methods based on DETR [17] and its derivative models, such as TransTrack [18], TrackFormer [19], and MOTR [20], represent the front of online tracking and training progress in the field of MOT. Swin-JDE leverages transformers and comprehensively considers three factors—detection confidence, appearance embedding distance, and IoU distance—to match each trajectory and the detection information. Furthermore, MOTR achieves end-to-end object tracking by iteratively updating tracking queries, eliminating the need for complex post-processing steps. MeMOT [21], similar to MOTR, utilizes attention mechanisms to predict by focusing on object states. Despite pioneering new tracking paradigms, these methods still fall short of advanced tracking algorithms. While standard self-attention can capture fine-grained short- and long-distance interactions, executing attention on high-resolution feature maps incurs high computational costs, leading to explosive growth in time and memory costs. This paper addresses this issue through a holistic self-attention module.

Therefore, we proposes a multi-object tracking method named GAO-Tracker based on a holistic transformer and multiple feature trajectory matching pattern to address various challenges in drone aerial videos. The effectiveness of the proposed method is validated through a series of experiments and quantitative analyses, and we compare it with excellent methods of the same kind and provide new insights and methods for multi-object tracking in drone applications. The main contributions are as follows:

(1) A framework named GAO-Tracker, which integrates object detection and tracking in a joint detection and tracking framework for drone aerial videos, is proposed. The framework employs a holistic transformer as the core model for object detection and includes a GAO trajectory matching algorithm based on object features in drone aerial videos to achieve efficient and precise multi-object tracking.

(2) The holistic transformer, which combines fine-grained local interactions and coarse-grained global interactions, is proposed. The framework includes an object detector holistic trans-detector using a joint anchor-free detection head to achieve accurate object detection in drone aerial videos.

(3) A multi-object trajectory prediction and matching module named the GAO-trajectory matching pattern is proposed; it comprehensively considers the appearance features, motion characteristics, and size features of objects and trajectories. It includes three matching modes: Gaussian-IOU, Appear-IOU, and OSPA-IOU, fully exploiting various object and trajectory information to achieve robust tracking of multiple objects in drone aerial videos.

(4) Using the prior information of the object’s position from the previous frame and combining it with object visual features, a visual Gaussian mixture probability hypothesis density (VGM-PHD) trajectory predictor tailored to the features of drone aerial videos is designed to provide accurate trajectory information for trajectory matching.

3. Methodology

The proposed multi-object tracking system for drone aerial videos consists of the holistic trans-detector module and the GAO-trajectory matching pattern trajectory association module. The holistic trans-detector model is an anchor-free object detector and feature extraction module that integrates holistic self-attention, combining fine-grained local and coarse-grained global interactions. In this new mechanism, each token finely attends to its nearest surrounding tokens and coarsely attends to its distant surrounding tokens, effectively capturing short-term and long-term visual dependencies. The GAO-trajectory matching pattern trajectory association module handles the data association process by simultaneously considering detection confidence, appearance embedding distance, and IOU distance, thereby enhancing the tracking robustness of the MOT model. The framework is illustrated in Figure 1.

3.1. Holistic Trans-Detector: Object Detection and Feature Extraction

In order to adapt to high-resolution visual tasks, high-resolution feature maps can be obtained in the early stages. The entire model adopts a hierarchical design consisting of four stages, each reducing the resolution of the input feature map and expanding the receptive field layer by layer, like a CNN. The framework is shown in Figure 2. At the beginning of the input, patch embedding is done, which cuts the image into individual blocks and embeds them into the embedding. Each stage is composed of multiple holistic transformer layers. The specific structure of the holistic transformer layer is shown in Figure 3; it is mainly composed of LayerNorm, MLP (multi-layer perceptron), and holistic attention.

An image with a resolution of

H \times W \times 3

is first divided into blocks of size

4 \times 4

, resulting in

\frac{H}{4} \times \frac{W}{4} \times (4 \times 4 \times 3)

patches. Then, these patches are projected into features of dimension d using a convolutional layer for which the kernel size and stride are both equal to 4. Given this spatial feature map, it is passed through four stages of concatenated holistic transformer layers. In each stage, the holistic transformer block consists of 2, 2, 18, and 2 holistic transformer layers, respectively. The selected configuration aims to capture complex features at different levels of abstraction gradually. In the initial stage, there are two layers, each aimed at capturing low-level features. In the middle stage, 18 layers focus on learning high-level and complex features. In the final stage, two layers refine these features to achieve precise tracking. After each stage, a patch embedding layer is added to reduce the spatial dimensions of the feature map by half while doubling the feature dimension. Finally, the feature maps from all four stages are sent to the detection head, which simultaneously outputs appearance feature vectors of the objects for multi-object trajectory matching.

Traditional transformer models face high computational and memory costs with large-scale input data due to the global self-attention mechanism, which considers all tokens in the input sequence. A holistic transformer addresses this by partitioning the input feature map into sub-windows and conducting attention operations on each sub-window, reducing computation and memory usage.

For a feature map of size

M \times N

for

x \in R^{M \times N \times d}

, we first divide it into partitions of size

4 \times 4

, with each partition serving as a feature perception core in order to perform attention perception within a localized context. Then, we locate the surrounding context for each window instead of individual tokens. Sub-window pooling is a core component of a holistic transformer and divides the input feature map into smaller sub-windows, thereby reducing the number of tokens each attention operation needs to focus on. This segmentation and pooling transforms global attention operations into local operations, making the model more scalable and efficient. The process is illustrated in Figure 4.

Suppose the input feature map is denoted as

x \in R^{M \times N \times d}

, where

M \times N

represents the spatial dimensions, and d represents the feature dimensions. Sub-window pooling is performed in parallel on the feature map at three levels

l \in {1, 2, 4}

, dividing the input feature map x into grids of size

l \times l

for spatial sub-window pooling, followed by a simple linear layer

f_{p}^{l}

to perform spatial sub-window pooling, as shown in Equation (1).

x^{l} = f_{p}^{l} (\hat{x}) \in R^{\frac{M}{l} \times \frac{N}{l} \times d}

(1)

where

\hat{x} = R e s t r u c t u r e (x) \in R^{(\frac{M}{l} \times \frac{N}{l} \times d) \times (l \times l)}

. The pooled feature maps at different levels l provide rich fine-grained and coarse-grained information.

3.1.1. Attention Computation

After obtaining the pooled feature maps at all levels, three linear projection layers

f_{q}

f_{k}

, and

f_{v}

are used to compute the query for the first layer and the key and value for all layers, as shown in Equations (2)–(4).

Q = f_{q} (x^{l})

(2)

K^{l} = f_{k} (x^{l})

(3)

V^{l} = f_{v} (x^{l})

(4)

To perform holistic self-attention, extracting surrounding tokens for each query token in the feature map is necessary. For the queries within the i-th window

Q_{i} \in R^{s_{p} \times s_{p} \times d}

, keys

K_{i} \in R^{s \times d}

and values

V_{i} \in R^{s \times d}

are extracted from the surrounding

K^{l}

and

V^{l}

of the window, where l represents the size of the keys and values, and s is the sum of all holistic regions from all levels, i.e.,

s = 8 \times 8 + 6 \times 6 + 5 \times 5

. Finally, the holistic self-attention for

Q_{i}

is computed as shown in Equation (5).

A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = S o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}} + B) V_{i}

(5)

where

B = {B^{l}}

is a learnable relative position bias. For the first layer, it is parameterized as

B^{l} \in R^{7 \times 7}

, while for other holistic levels, considering their different granularities towards queries, all queries within the window are treated equally.

B^{l} \in R^{s_{r}^{l} \times s_{r}^{l}}

is then used to represent the relative position deviation tokens between the query window and each pooled

s_{r}^{l} \times s_{r}^{l}

The relative position deviation takes into account the positional relationships between different sub-windows. This allows the model to understand the dependencies between different positions better, thus enabling more accurate attention computation. The introduction of relative position deviation enhances the flexibility and expressive power of the model, enabling it to adapt better to different types of input data.

Since the attention operations for each sub-window are independent, modern hardware and parallel computing frameworks can be leveraged to accelerate the model’s training and inference processes.

3.1.2. Detection Head

We designed an anchor-free prediction head based on the CenterNet architecture and divided it into detection and appearance branches. Through holistic transformer feature extraction, the output feature map is provided to both branches for object detection and appearance embedding. The detection branch consists of three heads, which are used to predict the heatmap, the offset of the object’s center point, and the object’s size, respectively.

The heatmap head is utilized to predict the center position of the object, with an output dimension of

h \times w \times C l s

, where h and w represent the height and width of the input feature map, and

C l s

is the number of detection classes. Each class has its own heatmap output, with each Gaussian peak in the heatmap representing the center position of the detected object. Assuming there are N objects in the current training sample, let

(c_{x}^{i}, c_{y}^{i})

represent the center position of the i-th object in

i \in [1, N]

. Then, the heatmap corresponding to the current training sample is calculated as shown in Equation (6).

M_{x y} = \sum_{i = 1}^{N} e x p (- \frac{(x - ⌊ \frac{c_{x}^{i}}{4} ⌋)^{2} + (y - ⌊ \frac{c_{y}^{i}}{4} ⌋)^{2}}{2 σ_{c}^{2}}) f

(6)

Here, the operator

⌊ a ⌋

returns the nearest and smallest integer to a, while

σ_{c}

is the standard deviation parameter.

M \in R^{h \times w \times C l s}

represents the output of the heatmap head, and

M_{x y}

serves as the value of M at position

(x, y)

The box size and center offset heads are used to predict the BBox and the offset of the object’s center point, respectively. Let

B B o x^{i} = (x_{l t}^{i}, y_{l t}^{i}, x_{r b}^{i}, y_{r b}^{i})

represent the BBox of the i-th object, where

(x_{l t}^{i}, y_{l t}^{i})

and

(x_{r b}^{i}, y_{r b}^{i})

represent the top-left and bottom-right coordinates of the object, respectively. Simultaneously, the offset of the center point of the i-th object is defined as shown in Equation (7).

o_{x y}^{i} ≜ (δ_{x}^{i}, δ_{y}^{i}) = (\frac{c_{x}^{i}}{4} - ⌊ \frac{c_{x}^{i}}{4} ⌋, \frac{c_{y}^{i}}{4} - ⌊ \frac{c_{y}^{i}}{4} ⌋)

(7)

This helps improve the accuracy of predicting the center position of the object. The term

\hat{o} \in R^{h \times w \times 2}

represents the output of the center offset head, and

{\hat{o}}_{x y}^{i}

represents the offset prediction of the i-th object at position

(x, y)

\hat{o}

The appearance branch is responsible for generating embedding features that assist in identifying the object. Each head consists of a

3 \times 3

convolutional layer with 256 channels, followed by a

1 \times 1

convolutional layer to produce the final output. The embedding heads of the appearance branch calculate the appearance feature vectors of the object, which are used in the association matching operation for multi-object tracking tasks. Specifically, these appearance feature vectors can be used for association matching to calculate the similarity between the tracker and the detected object. A 128-dimensional vector at position

(x, y)

represents the appearance feature vector of the object at that location.

3.2. GAO Trajectory Matching Pattern

Our GAO trajectory matching pattern considers detection confidence, appearance embedding distance, and IoU distance to associate all tracking trajectories with all detection Bboxes. Figure 5 illustrates the architecture of the module. When receiving detection results from the detector output, we add detection Bboxes with confidence scores higher than 0.5 to the high-score detection Bbox set, and those between 0.2 and 0.5 are added to the low-score detection Bbox set.

Initially, predicted trajectories are matched with high-score detection boxes using the Appear-IOU matching method. Unmatched trajectories then undergo secondary matching with low-score detection boxes via Gau-IOU matching, with any remaining unmatched low-score boxes removed. Subsequently, high-score detection boxes that were not initially matched are re-evaluated using (optimal subpattern assignment) OSPA-IOU matching with previously unmatched trajectories from the previous frame. High-score boxes unmatched after both attempts are considered new trajectories, while trajectories that have been continuously unmatched for 30 frames are removed from tracking, with flexibility to adjust based on the video frame rate.

Successful matches update tracking through the update process with matched detection frames. Trajectory prediction involves modeling visual objects’ trajectories as a random finite set, utilizing the visual Gaussian mixture probability hypothesis density to generate prediction information for the tracker, which primes the model for the next frame’s association matching.

The data association process employs four distance metrics, leading to the design of three matching methods: Gau-IOU, Appear-IOU, and OSPA-IOU distance matching.

3.2.1. Appear-IOU Distance Matching

Appear-IOU trajectory matching considers the appearance and spatial location features between the object and predicted trajectories while calculating the cosine distance and IOU distance between all predicted trajectories and high-scored detection appearances as metrics. The appearance vector of the object contains extensive appearance information, which is combined with the IOU distance of the BBox to enhance the matching accuracy between detection boxes and trajectories. The process as shown in Figure 6.

Let

(B B o x_{d}^{i}, E_{d}^{i})

represent the i-th detected object detection BBox and its corresponding feature vector in the current frame, and

(B B o x_{t}^{i}, E_{t}^{i})

represents the j-th trajectory-predicted object BBox and its corresponding feature vector from the previous frame. The first distance metric

D_{i j}^{I}

is computed based on the IOU distance:

D_{i j}^{I} = 1 - \frac{a r e a (B B o x_{d}^{i} \cap B B o x_{t}^{j})}{a r e a (B B o x_{d}^{i} \cup B B o x_{t}^{j})}

(8)

where

a r e a (A)

represents the area of the input set A, and the symbols ∩ and ∪ represent the intersection and union operator of two sets. The appearance distance metric

D_{i j}^{A}

is calculated based on the cosine distance between two embedding feature vectors:

D_{i j}^{A} = 1 - \frac{A_{d}^{i} \cdot A_{t}^{j}}{∥ A_{d}^{i} ∥ ∥ A_{t}^{j} ∥}

(9)

where · denotes the dot product between two vectors, and

∥ A ∥

denotes the 2-norm value of the vector.

Subsequently, the IOU distances and appearance distances between all detections and trajectories are combined in a weighted manner to obtain the Appear-IOU distance:

D^{A I} = α D_{i j}^{A} + (1 - α) D_{i j}^{I}

(10)

where

α

represents the proportion of the cosine distance, with values ranging between 0 and 1.

Finally, all Appear-IOU distances are merged into a cost matrix, and the Hungarian algorithm is employed to achieve the best match. Unmatched trajectories undergo secondary matching with low-scored detections through the Gau-IOU matching model. In contrast, unmatched high-scored detection boxes undergo secondary matching with inactive trajectories through the OSPA-IOU matching model.

3.2.2. Gau-IOU Distance Matching

The Gau-IOU distance matching process is illustrated in Figure 7. Low-scored detection boxes often represent small objects. In order to better extract object features, both the low-scored detections and the trajectories to be matched are transformed into Gaussian space. This transformation integrates the Wasserstein distance (WD) and the IOU distance between the Gaussian distributions of trajectories and objects.

We first transform the BBox of the object and trajectory into Gaussian space using a matrix transformation. For the object box represented by

(x, y, h, w)

, the parameters of the Gaussian distribution

N (x | μ, Σ)

are computed as:

μ = {[x, y]}^{T}

(11)

Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{w^{2}}{4} \end{matrix}]

(12)

The key to matching detection boxes with trajectories is how to calculate the similarity between the Gaussian distributions

N_{d} (x_{d} | μ_{d}, Σ_{d})

of the detection box and

N_{t} (x_{t} | μ_{t}, Σ_{t})

of the trajectory box. We use the Wasserstein distance to compute the distance between the two Gaussian distributions. The Wasserstein distance between two Gaussian distributions is defined as:

\begin{matrix} D_{W} (N_{d}, N_{t}) = {∥ μ_{d} - μ_{t} ∥}^{2} + T r (Σ_{t}) + T r (Σ_{t}) - 2 T r ({(Σ_{d}^{\frac{1}{2}} Σ_{t} Σ_{d}^{\frac{1}{2}})}^{\frac{1}{2}}) \end{matrix}

(13)

The Wasserstein distance primarily consists of two components: the distance between the center points, represented by

(x, y)

, and a coupling term related to

(h, w)

. Due to the chain-like coupling relationship formed by these parameters, which causes them to influence each other, the Wasserstein distance is highly advantageous for achieving high-precision matching.

Next, the IOU distance and the WD distance between all detections and trajectories are weighted to obtain the Appear-IOU distance.

D^{G I} = β D_{W} + (1 - β) D_{i j}^{I}

(14)

where

β

represents the proportion of the WD distance and takes values between 0 and 1. Finally, the Hungarian algorithm is employed to achieve the best matching between detections and trajectories based on all Gau-IOU distances. Unmatched trajectories are converted to inactive trajectories, and unmatched low-scored detections are removed.

3.2.3. OSPA-IOU Distance Matching

The OSPA distance allows for considering subpattern matching of object trajectories, enabling the model to better capture both the similarities and differences between object trajectories. This, in turn, provides a more accurate assessment of tracking performance. Building upon the foundation of IOU distance matching, we comprehensively consider the OSPA distance and propose the OSPA-IOU trajectory matching model. The process is illustrated in Figure 8.

Assume the object state set is

X = \{x_{1}, x_{2}, \dots, x_{m}\}

and the object trajectory set is

Y = \{y_{1}, y_{2}, \dots, y_{n}\}

, where

m, m \in N_{0} = \{0, 1, 2, \dots\}

represent the estimated and true numbers of objects, respectively. The OSPA distance is expressed as:

D_{p, e} (x, y) = {[\frac{1}{n} \underset{π \in Π_{n}}{m i n} Σ_{i = 1}^{m} {(d_{c} (x_{i}, y_{π (i)}))}^{p} + (n - m) c^{p}]}^{\frac{1}{p}}

(15)

where

Π_{n}

represents all permutations for selecting numbers from the set

\{1, 2, \dots, n\}

. If

p = 1

, the OSPA distance can be expressed as:

D_{p, c} = e_{p, c}^{l o c} (x, y) + e_{p, c}^{c a r d} (x, y)

(16)

e_{p, c}^{l o c} (x, y) = {[\frac{1}{n} \underset{π \in Π_{n}}{m i n} Σ_{i = 1}^{m} {(d_{c} (x_{i}, y_{π (i)}))}^{p}]}^{\frac{1}{p}}

(17)

e_{p, c}^{c a r d} (x, y) = {(\frac{1}{n} (n - m) c^{p})}^{\frac{1}{p}}

(18)

where

e_{p, c}^{l o c} (x, y)

and

e_{p, c}^{c a r d} (x, y)

represent the positional difference and cardinality difference between the sets of estimated object states and true object states, respectively. The positional difference signifies the spatial gap, while the cardinality difference encompasses performance metrics like the false track proportion, redundancy, and interruptions. The truncation parameter adjusts the balance between positional and cardinality differences, with smaller values prioritizing positional differences. Treating objects as single-element sets and trajectories as multi-element sets, we compute the OSPA distance between them to optimize matching between individual detections and trajectories.

Subsequently, the IOU distance and the WD distance between all detections and trajectories are weighted to derive the Appear-IOU distance.

D^{O I} = λ D_{p, c} + (1 - λ) D_{i j}^{I}

(19)

where

λ

represents the proportion of the OSPA distance and takes values between 0 and 1. Finally, the Hungarian algorithm is applied to achieve the best matching between detections and trajectories based on all Gau-IOU distances. Unmatched high-score detections are converted to new inactive trajectories, and inactive trajectories that remain unmatched for 30 frames are removed.

3.2.4. Visual Gaussian Mixture Probability Hypothesis Density

The visual Gaussian mixture probability hypothesis density (VGM-PHD) filtering algorithm utilizes the center positions of all trajectories as the measurement input for the random finite set, preserving object ID and size data to reconstruct trajectories. Assumptions include representing both spawned and newly born object PHDs as Gaussian mixtures, independence between object detection and survival probabilities, and modeling state transition density and observation likelihood functions as linear Gaussian models.

Both the motion model and observation model of the VGM-PHD filtering algorithm are set to be linear, and noise and errors follow Gaussian distributions. Using the weights, means, and variances of the PHD Gaussian distribution, the algorithm iteratively propagates the multi-object states. The specific implementation steps of the VGM-PHD filtering algorithm are as follows. Assuming the posterior PHD at a specific time is given by the following Gaussian sum form:

T_{k - 1} (x) = Σ_{i = 1}^{J_{k - 1}} ω_{k - 1}^{i} N (x; m_{k - 1}^{i}, P_{k - 1}^{i})

(20)

where

ω_{k}^{i}

m_{k}^{i}

, and

p_{k}^{i}

represent the weight, mean, and covariance of the i-th Gaussian component at time k for a single object state x, and

J_{k}

represents the number of Gaussian components at time k. The function

N (\dots)

represents variables that follow a Gaussian distribution. The predicted intensity function at time k is given by:

T_{k | k - 1} (x) = T_{s, k | k - 1} (x) + T_{β, k | k - 1} (x) + γ_{k} (x)

(21)

The three terms on the right side respectively represent the predicted PHDs of surviving objects, spawned objects, and newly born objects. The intensity function obtained from the GM-PHD filtering algorithm update can be expressed as:

T_{k} (x) = (1 - P_{D, k}) T_{k | k - 1} (x) + Σ_{z \in Z_{k}} T_{D, k} (x; z)

(22)

where the first term represents the PHD of missed objects, and the second term represents the updated PHD of detected objects. In the VGM-PHD filtering algorithm, if the PHD functions at time

k - 1

, the prior distribution generated by the prediction at time k and the posterior distribution obtained by the filtering update can both be represented in Gaussian mixture form. The weights can be obtained through PHD filtering, while the means and covariances are recursively obtained through Kalman filtering. During the prediction and update of the object PHD in VGM-PHD, the predicted object numbers

N_{k | k - 1}

and updated object numbers

N_{k}

are given by:

\begin{matrix} N_{k | k - 1} = Σ_{i = 1}^{J_{k | k - 1}} ω_{k | k - 1}^{n} = N_{k - 1} (P_{S, k} + Σ_{i = 1}^{J_{β, k}} ω_{β, k}^{i}) + Σ_{j = 1}^{J_{γ, k}} ω_{γ, k}^{i} \end{matrix}

(23)

\begin{matrix} N_{k} = Σ_{n = 1}^{J_{k}} ω_{k}^{n} = N_{k | k - 1} (1 - P_{D, k}) + Σ_{z \in Z_{k}} Σ_{j = 1}^{J_{k | k - 1}} ω_{k}^{j} (z) \end{matrix}

(24)

4. Experiments

4.1. Dataset and Evaluation Metrics

The proposed algorithm undergoes comprehensive evaluations on the VisDrone MOT [22] and UAVDT [23] datasets, which encompass diverse drone-captured scenes and facilitate a thorough assessment of the proposed methods’ practical effectiveness. Extensive evaluations compare the algorithm with other leading multi-object trackers across various scenarios and conditions. Established MOT evaluation metrics are utilized to assess performance comprehensively, with the aim of gauging overall effectiveness and pinpointing potential weaknesses in each model. The metrics include:

(1) FP (↓): Number of false positives in the entire video.

(2) FN (↓): Number of false negatives in the entire video.

(3) IDSW (↓): Number of identity switches in the entire video.

(4) FM (↓): Number of ground truth trajectories interrupted during the tracking process.

(5) IDF1 (↑): Ratio of correctly identified detections to the computed detections and ground truth.

(6) MOTA (↑): Combined FP, FN, and IDSW, scored as follows:

M O T A = 1 - \frac{(F N + F P + I D S W)}{G T}

(25)

where GT is the actual tracking result.

(7) MOTP (↑): Mismatches between ground truth and predicted results, calculated as:

M O T P = 1 - \frac{Σ_{t, i} d_{t, i}}{Σ_{t} c_{t}}

(26)

These metrics contribute to a comprehensive assessment of MOT algorithm performance in various aspects, providing in-depth insights into system effectiveness.

4.2. Training Preprocessing

Existing MOT methods integrating object detection and appearance embedding often use a single-stage training approach, where detection and appearance branches are trained simultaneously. While this reduces training time, it can harm detection performance due to differing learning objectives. In densely populated scenes, fully occluded objects may still have annotated bounding boxes in the training dataset, which can introduce errors when learning appearance embeddings and can reduce tracking accuracy. To address this, our proposed model filters highly occluded objects from the training samples before commencing model training. To implement this, we initially define a metric variable

B_{overlap} \in [0, 1]

to gauge the overlap between two ground truth Bboxes; the metric is defined as follows:

B_{o v e r l a p} = \frac{B B o x_{G T}^{i} ⋂ B B o x_{G T}^{j}}{B B o x_{G T}^{i} ⋃ B B o x_{G T}^{j}}

(27)

where

B B o x_{G T}^{i}

and

B B o x_{G T}^{j}

represent the i-th and j-th ground truth BBoxes of the input training samples, respectively. A higher value of the variable indicates greater overlap between the two ground-truth BBoxes. In object detection, a value

B_{o v e r l a p} \geq 0.75

signifies substantial overlap between two BBoxes. Therefore, in this study, we set the threshold at

B_{o v e r l a p} \geq 0.75

, considering smaller BBoxes as indicative of occluded objects and excluding them from the training dataset. We ultimately train the model using the filtered dataset.

4.3. Experimental Settings

The detector is initialized with pre-existing weights obtained from training on the COCO dataset. We train the detector using SGD with the following parameters: 150 epochs, a batch size of 16, a learning rate of 0.02, momentum set to 0.9, and decay set to 0.0001. We train the detector on both the VisDrone and UAVDT datasets and perform validation using the same set of verification images. We execute the testing on hardware (NVIDIA RTX 4090 with 24 GB of memory) and calculate the average of the top-100 most reliable detection results.

4.4. Comparative Experiments

4.4.1. Detection Comparison

To compare the performance of our detector, we select a total of seven excellent detectors: DETR [17], Deformable DETR [24], YOLO-S [25], Swin-JDE [18], VitDet [26], RTD-Net [27], and DN-DETR [28]. They are trained and evaluated on the VisDrone and UAVDT datasets using the experimental settings described in their respective papers. DETR completely discards traditional object detection components such as anchor boxes and non-maximum suppression and utilizes a complete attention mechanism for end-to-end object detection. Deformable DETR is an improved version of DETR that introduces deformable attention to enhance the model’s adaptability to changes in object shape and scale. YOLO-S employs a small feature extractor, skip connections, cascaded skip connections, and a reshaping pass-through layer to facilitate cross-network feature reuse, combining low-level positional information with more meaningful high-level information. The Swin-JDE algorithm adopts a Swin transformer based on windowed self-attention as the backbone network to enhance feature extraction capabilities. ViTDet utilizes ViT as the backbone for a Mask R-CNN object detection model, enhancing competitiveness by optimizing the RPN section. RTD-Net replaces positional linear projection with convolutional projection and uses an efficient convolutional multi-head self-attention algorithm based on convolutional transformer blocks to improve the recognition of occluded objects by extracting contextual information. DN-DETR introduces a novel denoising training approach to address the instability of bipartite graph matching in the DETR decoder during training, doubling the convergence speed and significantly improving the detection results.

The comparative results in Table 1 demonstrate the substantial advantages of our detection performance. AP is the average accuracy, and [email protected] and [email protected] indicate intersection-to-union ratios greater than 50% and 75%, respectively. APs, APm, and APl are the average accuracies for small objects (with an area less than 32 × 32 pixels), medium objects (with an area between 32 × 32 and 96 × 96 pixels), and large objects (with an area greater than 96 × 96 pixels), respectively. The visual comparison results in Figure 9 and Figure 10 show that our results exhibit excellent performance under various lighting conditions and crowded environments.

4.4.2. Tracking Comparison

We compared DeepSORT [29], ByteTrack [30], BoT-SORT [31], UAVMOT [32], DCMOT [33], TFAM [34], MTTJDT [35], and SimpleTrack [36] as well as transformer-based methods including TransTrack [37], TrackFormer [38], TransCenter [39], MOTR [20], MeMOT [21], GTR [40], TR-MOT [41], GCEVT [42], STN-Track [43], and STDFormer [19]. These comparisons were conducted on the VisDrone MOT and UAVDT datasets.

To ensure consistent comparisons despite variations in object distributions across datasets, we employed the holistic trans-detector to produce uniform detection results for all tracking comparison methods. This approach mitigates evaluation bias stemming from uneven category distributions, fostering fairer and more reliable tracking method comparisons. To maintain detection accuracy across categories during evaluation, distinct thresholds were applied: 0.3 for cars, 0.1 for trucks, and 0.4 for pedestrians, with a lower threshold of 0.05 for buses, which present greater visual variability.

Table 2 and Table 3 comprehensively compare GAO-Tracker with other popular trackers on the VisDrone MOT and UAVDT datasets. The evaluation includes critical metrics such as MOTA, MOTP, IDF1, and IDSW and comparisons with other methods. GAO-Tracker demonstrates excellent performance by effectively utilizing position and appearance information. DeepSORT associates categories independently using positional information. ByteTrack utilizes low-scoring detection for similarity tracking and background noise filtering. BoT-SORT incorporates camera motion compensation for improved matching. UAVMOT enhances object feature association with an ID feature update module. SimpleTrack merges object embedding cosine and GIOU distances to create a new association matrix. Transformer-based methods like TransTrack employ a query–key mechanism for existing object tracking and new object detection. TrackFormer considers position, occlusion, and object recognition features simultaneously. TransCenter predicts the association’s heatmap of object centers globally. MOTR models the entire trajectory of an object using a tracking query. MeMOT uses information from previous frames for tracking clues. GTR extends the window length for matching and utilizes interaction information fully. TR-MOT achieves reliable associations using visual temporal features. STDFormer utilizes the transformer’s remote modeling capability for intent and decision information extraction. However, these methods apply a single matching rule for all detection classes, leading to inaccurate tracking of various object classes and poorer performance.

Combining the data from Table 2 and Table 3, we observe that transformer-based methods outperform motion-based methods. This trend reflects the effectiveness and superiority of transformer-based methods for addressing multi-object tracking problems in drone aerial videos and that transformer-based methods enable better capturing of long-distance dependencies between objects in complex environments and better handling of challenges such as object occlusion and scale changes.

Figure 11 and Figure 12 show time-order frames with bounding boxes and different-colored identities. In the initial images (left), bounding boxes may appear inconsistent due to occlusion. However, in the final images (right), GAO-Tracker maintains consistent bounding boxes, reducing the identity switching of pedestrians. The center images show intermediate steps where identities might temporarily switch due to occlusions or overlaps. The final images (right) demonstrate GAO-Tracker’s ability to preserve identities throughout the sequence, even in crowded scenarios. By utilizing object motion information, GAO-Tracker’s trajectory association technology effectively solves the problems of missed detection and incorrect detection caused by occlusion, especially in the case of short-term overlapping objects. Compared with previous algorithms based on bounding box connections, GAO-Tracker reduces pedestrian identity switching. The results indicate that GAO-Tracker performs well in crowded scenarios of drone aerial videos and ensures consistent bounding boxes and identities throughout the entire sequence.

4.5. Ablation Experiments

To demonstrate the effectiveness of the designed method, we conducted multiple sets of ablation experiments on training preprocessing strategies, the GAO module, the sequence of various matching strategies, and VGM-PHD on the VisDrone and UAVDT datasets.

4.5.1. Effect of Backbone

To validate the effectiveness of our holistic trans as the backbone network, we compared it with ResNet50, DLA-34, ViT, and Swin-L and conducted ablation experiments. Table 4 presents the performance evaluation results of the proposed GAO-Tracker combined with different backbone networks. This experiment used the proposed data association method as the post-processing module and evaluated the UAVDT and VisDrone test datasets. Based on the results in Table 4, we have the following findings: In the evaluation results of UAVDT, using DLA-34 as the backbone network yielded the best performance, with MOTA, MOTP, and IDF1 scores reaching 61.9%, 75.1%, and 66.4%, respectively. Additionally, using the holistic trans backbone network resulted in the lowest IDSW count. In the evaluation results of VisDrone, compared to ResNet50, DLA-34, ViT, and Swin-L, using the holistic trans backbone network achieved 38.8% MOTA, 76.3% MOTP, and 54.3% IDF1 and a significant reduction in FP. Since VisDrone contains many congested scenes, the experimental results indicate that using the holistic trans backbone network can improve MOT performance in crowded scenarios.The tracking performance using the DLA-34 backbone network was the best on UAVDT but was significantly worse on VisDrone. In contrast, using the holistic trans backbone network resulted in inferior tracking performance on UAVDT but the best performance on VisDrone. The MOTA increase and the FP decrease using the holistic trans backbone network indicate that our model significantly enhances the detection capability of correct objects.

Based on the observations above, it can be concluded that the backbone network significantly impacts the tracking performance of multi-object trackers depending on the density of tracking objects in the scene. Therefore, improving the feature extraction capability of the backbone network model is a crucial factor affecting the tracking performance of multi-object trackers.

4.5.2. Impact of Pre-Processing and Detection Results Classification

During the training process of the multi-object tracking model, we attempted to train the network by removing highly overlapped objects to provide efficient and accurate appearance embedding information for multi-object tracking matching. We also explored the impact of classifying high- and low-scoring detection boxes. As shown in Table 5, we verified the effectiveness by adding or not adding training set optimization and detection score branches. “Pre” indicates training with removing highly overlapped objects, while “Grade” signifies the model distinguished between high- and low-scoring detection boxes for input into the GAO trajectory association pattern.

The results indicate that removing ground truth Bbox annotations for occluded objects can reduce errors in learning appearance embeddings, thereby improving the accuracy of tracked object identification. By differentiating between low- and high-scoring detection boxes, it is possible to effectively reduce trajectory fragmentation and IDSW, thus enhancing the effectiveness and performance of object tracking. Additionally, by using preprocessing and detection result classification, the MOTA, MOTP, and IDF1 on VisDrone improved by 2.6%, 5.4%, and 1.8%, respectively, while on UAVDT, they improved by 2.6%, 5.4%, and 1.8% respectively.

4.5.3. Impact of Matching Strategies

We validated the individual contributions of each component by combining different association strategies, as shown in Table 6. The baseline uses IOU matching for all associations, and we gradually replace it with Appear-IOU, Gau-IOU, and OSPA-IOU on top of the baseline. The results indicate that all three proposed association strategies effectively enhance the accuracy of tracking associations. The baseline model shows significantly higher FP and more IDSW, indicating a higher number of false detections introduced by the model, resulting in poor trajectory matching quality and increased identity switching. After replacing the high-score detection box matching strategy with Appear-IOU, MOTA and IDF1 showed noticeable improvements. However, there was a slight increase in FP, and the powerful detection capability significantly reduced FN. After replacing the low-score detection box matching strategy with Gau-IOU, MOTA and MOTP improved significantly. At the same time, IDSW decreased substantially, demonstrating the effectiveness of matching smaller low-score detection boxes in Gaussian space. Substituting the OSPA-IOU distance-based object-to-trajectory matching method, the high-score detection boxes are considered to be a collection of individual trajectories for matching against trajectory collections, improving all metrics. These results indicate that our various strategies contribute to better overall tracking performance.

4.5.4. Impact of VGM-PHD

We designed ablation experiments to validate the effectiveness of the VGM-PHD method. We compared this method against no trajectory prediction and the use of a Kalman filter. The results are presented in Table 7. The findings indicate that VGM-PHD exhibits higher prediction accuracy and robustness compared to no trajectory prediction and the Kalman filter across multiple scenarios. In complex environments particularly, the new method successfully overcomes the limitations of traditional approaches, enhancing the accuracy of predicting future positions of moving objects. Moreover, the decrease in IDSW and the increase in IDF1 suggest improved stability in trajectory tracking. Consequently, overall tracking performance is enhanced.

5. Discussion

In this paper, we have integrated the strengths of joint detection and visual multi-object tracking algorithms with transformer-based visual multi-object tracking algorithms to address the unique challenges posed by drone aerial videos. Our proposed GAO-Tracker, which models object motion information, has demonstrated significant improvements in tracking performance, particularly in complex real-world scenarios.

5.1. Performance Analysis

GAO-Tracker’s performance on the VisDrone and UAVDT datasets has shown remarkable results, surpassing existing state-of-the-art methods in terms of both accuracy and robustness. The integration of the transformer model’s global context capturing capabilities with the joint detection and tracking methods’ handling of occlusions and scale variations has proven effective. The results indicate that our approach can maintain high tracking accuracy even in challenging environments that are characterized by rapid object motion, complex backgrounds, and varying object scales.

5.2. Strengths

(1) Enhanced accuracy: By leveraging the transformer model’s self-attention mechanism, GAO-Tracker effectively captures long-range dependencies and global contexts, which are crucial for accurately tracking multiple objects in aerial videos.

(2) Robustness to occlusions and scale variations: The joint detection and tracking methods integrated into GAO-Tracker enable it to handle occlusions and significant scale variations efficiently, ensuring continuous and reliable tracking.

(3) Practical solutions: GAO-Tracker provides practical solutions to real-world multi-object tracking problems, making it highly applicable in various domains such as surveillance, rescue operations, and urban planning.

5.3. Limitations

(1) Computational complexity: Despite its accuracy and robustness, the computational expense associated with the transformer model’s fine-grained self-attention mechanism remains a challenge. This could potentially limit the real-time applicability of GAO-Tracker in resource-constrained environments.

(2) Scalability: While GAO-Tracker performs well on benchmark datasets, its scalability to handle extremely large-scale datasets or highly crowded scenes requires further exploration and optimization.

5.4. Future Directions

To further enhance the performance and applicability of GAO-Tracker, several future research directions are proposed:

(1) Algorithm optimization: Efforts will focus on optimizing the algorithm to reduce computational complexity and improve real-time performance. This includes exploring more efficient implementations of the transformer model and refining the integration with detection and tracking components.

(2) Broader application areas: Extending the research findings to benefit more diverse fields is a key future direction. Improvements in drone-based multi-object tracking can be adapted for use in autonomous driving, security systems, wildlife monitoring, and other domains requiring accurate and reliable tracking of multiple objects.

(3) Handling complex scenarios: Further research is needed to enhance GAO-Tracker’s performance in highly dynamic and crowded environments. This includes developing methods to better handle dense object interactions and rapidly changing scenes.

(4) Long-term tracking: Enhancing the system’s ability to maintain long-term tracking stability and accuracy, particularly in scenarios with frequent object disappearances and reappearances, is another important area for future work.

6. Conclusions

This paper aims to integrate the strengths of joint detection and visual multi-object tracking algorithms with transformer-based visual multi-object tracking algorithms to improve the performance of multi-object tracking in drone aerial videos. Additionally, we propose a more comprehensive, robust, and efficient integrated multi-object tracking algorithm by modeling object motion information.

By leveraging the advanced capabilities of transformer models to capture global contexts and the strengths of joint detection and tracking methods in handling occlusions and scale variations, our approach addresses the unique challenges posed by drone aerial videos, such as rapid object motion, complex environmental conditions, and data noise. This integration allows for more accurate and reliable tracking of multiple objects, enhancing the overall performance and robustness of tracking systems in various real-world scenarios.

A series of novel results have been achieved in the drone aerial multi-object tracking field, with GAO-Tracker demonstrating excellent results on the VisDrone and UAVDT datasets. These datasets, which are widely used benchmarks in the field, have shown that our method significantly outperforms existing state-of-the-art methods in terms of both accuracy and robustness. This indicates GAO-Tracker’s strong potential for practical applications in surveillance, rescue operations, agriculture, and urban planning, among others.

The practical solutions provided by GAO-Tracker to multi-object tracking problems in real-world scenarios offer new ideas and methods for the development of drone visual tracking. Our approach not only contributes to the current body of knowledge but also paves the way for future research in this area. In the future, efforts will focus on improving and optimizing algorithms to enhance multi-object tracking performance further. This includes refining the integration of detection and tracking components, enhancing the efficiency of the transformer model, and exploring new ways to handle challenging scenarios such as crowded environments and dynamic backgrounds.

Additionally, endeavors will be made to extend research findings to broader application areas to benefit more diverse fields. For instance, improvements in drone-based multi-object tracking can be adapted for use in autonomous driving, security systems, wildlife monitoring, and other areas where real-time, accurate tracking of multiple objects is critical. By expanding the applicability of our research, we aim to contribute to the advancement of technology across various domains, ultimately enhancing the capabilities and reliability of multi-object tracking systems.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y., Y.W. and Y.L.; software, Y.P. and Y.L.; formal analysis, Y.W.; investigation, Y.W.; resources, Y.P.; data curation, L.Z., Y.P. and Y.L.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Y.L.; supervision, L.Z.; project administration, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Funding for Outstanding Doctoral Dissertation in NUAA under grant BCXJ24-10, the Postgraduate Research and Practice Innovation Program of Jiangsu Province under grant KYCX24_0583, the National Natural Science Foundation of China under grant 61573183, and the Natural Science Foundation of Shaanxi Province of China under grant 2024JC-YBQN-0695.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MOT	Multiple object tracking
GAO	Gaussian, appearance, and optimal subpattern assignment
IOU	Intersection over union
OSPA	Optimal subpattern assignment
VGM-PHD	Visual Gaussian mixture probability hypothesis density
MOTA	Multiple object tracking accuracy
MOTP	Multiple object tracking precision

References

Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Li, Y.; Zhang, H.; Yang, Y.; Liu, H.; Yuan, D. RISTrack: Learning Response Interference Suppression Correlation Filters for UAV Tracking. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389. [Google Scholar] [CrossRef]
Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: A composite transformer network for urban scene segmentation of UAV images. Pattern Recogn. 2023, 133, 109019. [Google Scholar] [CrossRef]
Yongqiang, X.; Zhongbo, L.; Jin, Q.; Zhang, K.; Zhang, B.; Feng, Q. Optimal video communication strategy for intelligent video analysis in unmanned aerial vehicle applications. Chin. J. Aeronaut. 2020, 33, 2921–2929. [Google Scholar]
Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Chen, G.; Wang, W.; He, Z.; Wang, L.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van, G.; Han, J.; et al. VisDrone-MOT2021: The Vision Meets Drone Multiple Object Tracking Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 2839–2846. [Google Scholar]
Bisio, I.; Garibotto, C.; Haleem, H.; Lavagetto, F.; Sciarrone, A. Vehicular/Non-Vehicular Multi-Class Multi-Object Tracking in Drone-based Aerial Scenes. IEEE Trans. Veh. Technol. 2023, 73, 4961–4977. [Google Scholar] [CrossRef]
Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple Object Tracking of Drone Videos by a Temporal-Association Network with Separated-Tasks Structure. Remote Sens. 2022, 14, 3862. [Google Scholar] [CrossRef]
Al-Shakarji, N.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in uav. In Proceedings of the 1st ACM International Conference on Multimedia in Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
Wang, Z.; Zheng, L.; Liu, Y.; Li, Y.; Wang, S. Towards real-time multi-object tracking. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 107–122. [Google Scholar]
Wu, H.; Nie, J.; He, Z.; Zhu, Z.; Gao, M. One-shot multiple object tracking in UAV videos using task-specific fine-grained features. Remote Sens. 2022, 14, 3853. [Google Scholar] [CrossRef]
Shi, L.; Zhang, Q.; Pan, B.; Zhang, J.; Su, Y. Global-Local and Occlusion Awareness Network for Object Tracking in UAVs. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8834–8844. [Google Scholar] [CrossRef]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking objects as points. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 474–490. [Google Scholar]
Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 145–161. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Tsai, C.; Shen, G.; Nisar, H. Swin-JDE: Joint detection and embedding multi-object tracking in crowded scenes based on swin-transformer. Eng. Appl. Artif. Intel. 2023, 119, 105770. [Google Scholar] [CrossRef]
Hu, M.; Zhu, X.; Wang, H.; Cao, S.; Liu, C.; Song, Q. STDFormer: Spatial-Temporal Motion Transformer for Multiple Object Tracking. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6571–6594. [Google Scholar] [CrossRef]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. Motr: End-to-end multiple-object tracking with transformer. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 659–675. [Google Scholar]
Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8090–8100. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Hu, Q.; Ling, H. Vision meets drones: Past, present and future. arXiv 2020, arXiv:2001.06303. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296. [Google Scholar]
Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-Time Object Detection Network in UAV-Vision Based on CNN and Transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Liu, S.; Li, X.; Lu, H.; He, Y. Multi-Object Tracking Meets Moving UAV. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8876–8885. [Google Scholar]
Deng, K.; Zhang, C.; Chen, Z.; Hu, W.; Li, B.; Lu, F. Jointing Recurrent Across-Channel and Spatial Attention for Multi-Object Tracking With Block-Erasing Data Augmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4054–4069. [Google Scholar] [CrossRef]
Xiao, C.; Cao, Q.; Zhong, Y.; Lan, L.; Zhang, X.; Cai, H.; Luo, Z. Enhancing Online UAV Multi-Object Tracking with Temporal Context and Spatial Topological Relationships. Drones 2023, 7, 389. [Google Scholar] [CrossRef]
Keawboontan, T.; Thammawichai, M. Toward Real-Time UAV Multi-Target Tracking Using Joint Detection and Tracking. IEEE Access 2023, 11, 65238–65254. [Google Scholar] [CrossRef]
Li, J.; Ding, Y.; Wei, H.; Zhang, Y.; Lin, W. Simpletrack: Rethinking and improving the jde approach for multi-object tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-Taixe, L.; Feichtenhofer, C. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8844–8854. [Google Scholar]
Xu, Y.; Ban, Y.; Delorme, G.; Gan, C.; Rus, D.; Alameda-Pineda, X. TransCenter: Transformers with dense representations for multiple-object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7820–7835. [Google Scholar] [CrossRef] [PubMed]
Zhou, X.; Yin, T.; Koltun, V.; Krähenbühl, P. Global Tracking Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8771–8780. [Google Scholar]
Chen, M.; Liao, Y.; Liu, S.; Wang, F.; Hwang, J. TR-MOT: Multi-Object Tracking by Reference. arXiv 2022, arXiv:2203.16621. [Google Scholar]
Wu, H.; He, Z.; Gao, M. GCEVT: Learning Global Context Embedding for Vehicle Tracking in Unmanned Aerial Vehicle Videos. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject Tracking of Unmanned Aerial Vehicles by Swin Transformer Neck and New Data Association Method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743. [Google Scholar] [CrossRef]

Figure 1. GAO-Tracker framework.

Figure 2. Holistic trans-detector.

Figure 3. Holistic transformer.

Figure 4. Holistic self-attention. We initially partition the feature map into 4 × 4 grids. While the central 4 × 4 grid serves as the query window, we extract tokens at three granularity levels of 1 × 1, 2 × 2, and 4 × 4, respectively, from surrounding regions to serve as its keys and values. This results in tokens with dimensions of 8 × 8, 6 × 6, and 5 × 5. Ultimately, these tokens from the three levels are concatenated to compute the keys and values for the 4 × 4 = 16 tokens (queries) within the window.

Figure 5. GAO trajectory matching module.

Figure 6. Appear-IOU trajectory matching.

Figure 7. Gau-IOU trajectory matching.

Figure 8. OSPA-IOU trajectory matching.

Figure 9. Comparison of detection results on the VisDrone dataset.

Figure 10. Comparison of detection results on the UAVDT dataset.

Figure 11. Tracking results of GAO-Tracker on the VisDrone dataset.

Figure 12. Tracking results of GAO-Tracker on the UAVDT dataset.

Table 1. The detection results of the detectors on the datasets.

Dataset	Detector	AP	[email protected]	[email protected]	APs	APm	APl
VisDrone	DETR [17]	34.8	63.4	32.2	12.8	38.5	55.6
	Deformable DETR [24]	36.9	60.4	35.2	9.9	38.1	52.7
	YOLOS [25]	36.6	63.1	38.7	15.4	39.9	54.9
	Swin-JDE [18]	38.2	60.5	34.8	11.1	41.4	57.6
	VitDet [26]	38.9	64.7	38.7	19.6	40.5	57.8
	RTD-Net [27]	38.1	64.6	40.2	17.6	42.8	57.6
	DN-DETR [28]	39.4	63.4	36.5	16.8	42.5	59.2
	Holistic Trans-Det	39.6	67.9	40.8	18.6	40.3	59.4
UAVDT	DETR [17]	48.8	69.3	49.3	28.0	47.5	57.1
	Deformable DETR [24]	47.2	69.2	50.3	29.0	53.2	59.4
	YOLOS [25]	49.3	71.1	51.4	32.3	50.4	58.9
	Swin-JDE [18]	49.6	69.9	52.8	33.9	54.8	59.7
	VitDet [26]	54.6	68.9	59.5	37.5	57.9	61.0
	RTD-Net [27]	52.2	71.4	55.6	36.3	57.2	60.9
	DN-DETR [28]	56.7	68.6	60.2	38.7	59.8	62.9
	Holistic Trans-Det	57.5	69.0	60.5	38.8	61.5	67.9

Table 2. Comparison between GAO-Tracker and the latest multiple trackers tested on the VisDrone dataset.

	Tracker	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
Motion- based	DeepSORT [29]	19.4	69.8	33.1	6387	38.8	52.2	15,181	44,830
	ByteTrack [30]	25.1	72.6	40.8	4590	42.8	50.3	10,722	24,376
	BoT-SORT [31]	23.0	71.6	41.4	7014	51.9	73.6	10,701	47,922
	UAVMOT [32]	25.0	72.3	40.5	6644	52.6	49.6	10,134	55,630
	DCMOT [33]	33.5	76.1	45.5	1139	-	-	12,594	64,856
	TFAM [34]	30.9	74.4	42.7	3998	-	-	27,732	126,811
	MTTJDT [35]	31.2	73.2	43.6	2415	-	-	25,976	183,381
Transformer- based	TransTrack [37]	27.3	62.1	28.3	2523	33.5	59.7	15,028	51,396
	TrackFormer [38]	24	77.3	38	4724	39	46.3	11,731	32,807
	TransCenter [39]	29.9	66.6	46.8	3446	33.4	61.8	15,104	20,894
	MOTR [20]	13.1	72.4	47.1	2997	52.9	72	12,216	42,186
	MeMOT [21]	29.4	73	48.7	3755	46.7	47.9	9963	30,062
	GTR [40]	28.1	76.8	54.5	2000	61.3	57.6	8165	10,553
	TR-MOT [41]	29.9	64.3	46	1005	42.8	59.9	7593	17,352
	GCEVT [42]	34.5	73.8	50.6	841	520	612	-	-
	STN-Track [43]	38.6	-	73.7	668	31.4	51.2	7385	76,006
	STDFormer [19]	35.9	74.5	59.9	1441	52.7	60.3	8527	20,558
	GAO-Tracker	38.8	76.3	54.3	972	55.9	52.4	6883	10,204

Table 3. Comparison between GAO-Tracker and the latest multiple trackers tested on the UAVDT dataset.

	Tracker	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
Motion- based	DeepSORT [29]	35.9	71.5	58.3	698	43.4	25.7	50,513	59,733
	ByteTrack [30]	39.1	74.3	44.7	2341	43.8	28.1	14,468	87,485
	BoT-SORT [31]	37.2	72.1	53.1	1692	40.8	27.3	42,286	64,494
	UAVMOT [32]	43.0	73.5	61.5	641	45.3	22.7	27,832	65,467
	SimpleTrack [36]	45.3	73.9	57.1	1404	43.6	22.5	21,153	53,448
	TFAM [34]	47.0	72.9	67.8	506	-	-	68,282	111,959
Transformer- based	TransTrack [37]	33.2	72.4	67.6	1122	38.9	23.8	50,746	54,938
	TrackFormer [38]	53.4	74.2	46.3	2247	43.7	23.3	13,719	91,061
	TransCenter [39]	48.9	73.9	51.3	2287	32.6	35.1	27,995	93,013
	MOTR [20]	35.6	72.5	56.1	1759	39.8	29.3	39,733	56,368
	MeMOT [21]	45.6	74.6	62.8	2118	34.9	26.5	38,933	59,156
	GTR [40]	46.5	75.3	61.1	1482	42.7	18.6	21,676	52,617
	TR-MOT [41]	57.7	74.1	55.7	2461	33.9	21.3	32,217	50,838
	GCEVT [42]	47.6	73.4	68.6	1801	618	363	-	-
	STN-Track [43]	60.6	-	73.1	1420	57.0	17.0	12,825	61,760
	STDFormer [19]	60.6	74.8	61.7	1642	44.6	20.3	20,258	41,895
	GAO-Tracker	61.7	75.2	67.9	1216	45.3	24.6	24,915	59,640

Table 4. Performance evaluation of the proposed GAO-Tracker model combined with different backbone networks.

Dataset	Detector Backbone	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
VisDrone	ResNet-50	19.6	59.9	36.7	4287	35.3	31.3	9078	18,764
	DLA-34	34.9	68.5	50.3	2198	46.3	43.5	8818	13,070
	ViT	35.2	69.7	51.0	2019	48.9	45.9	8009	12,897
	Swin-L	35.5	70.2	52.3	1509	51.9	47.6	6832	12,223
	Holistic Trans	38.8	76.3	54.3	972	55.9	52.4	6883	10,204
UAVDT	ResNet-50	56.2	70.3	62.1	2252	40.4	22.6	32,743	72,629
	DLA-34	61.9	75.1	66.4	1798	42.4	23.4	28,705	65,616
	ViT	60.1	74.0	65.9	1504	42.8	23.7	26,937	62,348
	Swin-L	59.6	74.4	66.0	1264	43.9	23.8	25,822	61,324
	Holistic Trans	61.7	75.2	67.9	1216	45.3	24.6	24,915	59,640

Table 5. Comparison between detection and classification with or without preprocessing.

Dataset	Method	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
VisDrone	Baseline	36.2	70.9	52.5	1344	53.1	49.3	9117	11,987
	B+Pre	37.6	71.2	52.8	1320	54.3	50.1	9135	11,499
	B+Grade	37.3	74.2	52.7	1138	54.7	51.2	9627	11,060
	B+Pre+Grade	38.8	76.3	54.3	972	55.9	52.4	6883	10,204
UAVDT	Baseline	57.8	72.0	64.0	1841	42.4	23.3	29,057	67,373
	B+Pre	59.3	74.4	65.6	1398	43.8	23.8	25,836	62,429
	B+Grade	60.4	74.7	66.1	1221	44.5	23.9	25,418	60,828
	B+Pre+Grade	61.7	75.2	67.9	1216	45.3	24.6	24,915	59,640

Table 6. Comparison of different association strategies.

Dataset	Method	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
VisDrone	Baseline	36.2	70.9	52.5	1552	53.1	49.3	9117	11,987
	B+Appear-IOU	37.1	74.4	53.2	1334	53.2	49.2	9209	11,027
	B+Appear-IOU+Gau-IOU	38.3	75.6	53.9	1052	53.8	49.9	7343	10,946
	B+Appear-IOU+Gau-IOU+ OSPA-IOU	38.8	76.3	54.3	972	55.9	52.4	6883	10,204
UAVDT	Baseline	57.8	72.0	64.0	1841	42.4	23.3	29,057	67,373
	B+Appear-IOU	58.2	73.8	64.9	1536	43.0	23.7	29,133	63,781
	B+Appear-IOU+Gau-IOU	60.9	74.9	66.3	1297	45.0	24.0	25,011	60,369
	B+Appear-IOU+Gau-IOU+ OSPA-IOU	61.7	75.2	67.9	1216	45.3	24.6	24,915	59,640

Table 7. Comparison with and without trajectory prediction.

Dataset	Method	MOTA↑	MOTP↑	IDF1 (%)↑	IDSW↓	MT (%)↑	ML (%)↑	FP↓	FN↓
VisDrone	No trajectory prediction	29.9	64.4	49.3	2497	42.8	42.8	8719	15,226
	Kalman Filter	35.3	69.9	50.6	1727	51.4	47.5	8998	12,302
	VGM-PHD	38.8	76.3	54.3	972	55.9	52.4	6883	10,204
UAVDT	No trajectory prediction	43.1	61.2	46.4	4437	32.3	17.4	49,018	99,620
	Kalman Filter	52.8	68.0	56.1	3069	37.4	21.0	38,389	82,471
	VGM-PHD	61.7	75.2	67.9	1216	45.3	24.6	24,915	59,640

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, Y.; Wu, Y.; Zhao, L.; Pang, Y.; Liu, Y. Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern. Drones 2024, 8, 349. https://doi.org/10.3390/drones8080349

AMA Style

Yuan Y, Wu Y, Zhao L, Pang Y, Liu Y. Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern. Drones. 2024; 8(8):349. https://doi.org/10.3390/drones8080349

Chicago/Turabian Style

Yuan, Yubin, Yiquan Wu, Langyue Zhao, Yaxuan Pang, and Yuqi Liu. 2024. "Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern" Drones 8, no. 8: 349. https://doi.org/10.3390/drones8080349

Article Menu

Multiple Object Tracking in Drone Aerial Videos by a Holistic Transformer and Multiple Feature Trajectory Matching Pattern

Abstract

1. Introduction

2. Related Work

2.1. Multiple Object Tracking

2.2. Object-Feature-Based Multi-Object Tracking Methods

2.3. Joint Detection and Tracking Multi-Object Methods

2.4. Transformer-Based Multi-Object Tracking Methods

3. Methodology

3.1. Holistic Trans-Detector: Object Detection and Feature Extraction

3.1.1. Attention Computation

3.1.2. Detection Head

3.2. GAO Trajectory Matching Pattern

3.2.1. Appear-IOU Distance Matching

3.2.2. Gau-IOU Distance Matching

3.2.3. OSPA-IOU Distance Matching

3.2.4. Visual Gaussian Mixture Probability Hypothesis Density

4. Experiments

4.1. Dataset and Evaluation Metrics

4.2. Training Preprocessing

4.3. Experimental Settings

4.4. Comparative Experiments

4.4.1. Detection Comparison

4.4.2. Tracking Comparison

4.5. Ablation Experiments

4.5.1. Effect of Backbone

4.5.2. Impact of Pre-Processing and Detection Results Classification

4.5.3. Impact of Matching Strategies

4.5.4. Impact of VGM-PHD

5. Discussion

5.1. Performance Analysis

5.2. Strengths

5.3. Limitations

5.4. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI