Visual Object Tracking in First Person Vision
Visual Object Tracking in First Person Vision
https://doi.org/10.1007/s11263-022-01694-6
Received: 3 December 2021 / Accepted: 22 September 2022 / Published online: 18 October 2022
© The Author(s) 2022
Abstract
The understanding of human-object interactions is fundamental in First Person Vision (FPV). Visual tracking algorithms which
follow the objects manipulated by the camera wearer can provide useful information to effectively model such interactions. In
the last years, the computer vision community has significantly improved the performance of tracking algorithms for a large
variety of target objects and scenarios. Despite a few previous attempts to exploit trackers in the FPV domain, a methodical
analysis of the performance of state-of-the-art trackers is still missing. This research gap raises the question of whether current
solutions can be used “off-the-shelf” or more domain-specific investigations should be carried out. This paper aims to provide
answers to such questions. We present the first systematic investigation of single object tracking in FPV. Our study extensively
analyses the performance of 42 algorithms including generic object trackers and baseline FPV-specific trackers. The analysis
is carried out by focusing on different aspects of the FPV setting, introducing new performance measures, and in relation to
FPV-specific tasks. The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed
of 150 densely annotated video sequences. Our results show that object tracking in FPV poses new challenges to current visual
trackers. We highlight the factors causing such behavior and point out possible research directions. Despite their difficulties,
we prove that trackers bring benefits to FPV downstream tasks requiring short-term object tracking. We expect that generic
object tracking will gain popularity in FPV as new and FPV-specific methodologies are investigated.
Keywords First person vision · Egocentric vision · Visual object tracking · Single object tracking
1 Introduction ment from a point of view that is the most similar to the one of
human beings. In the FPV domain, understanding the interac-
First Person Vision (FPV) refers to the study and devel- tions between a camera wearer and the surrounding objects is
opment of computer vision techniques considering images a fundamental problem (Bertasius et al., 2017a, 2017b; Cao
and videos acquired from a camera mounted on the head et al., 2020; Cai et al., 2016; Damen et al., 2018; Damen et al.,
of a person—which is referred to as the camera wearer. This 2016; Furnari & Farinella 2020; Grauman 2022; Liu et al.,
setting allows machines to perceive the surrounding environ- 2020; Ragusa et al., 2020; Wang et al., 2020). To model such
interactions, the continuous knowledge of where an object
Communicated by Yoichi Sato. of interest is located inside the video frame is advantageous.
Indeed, keeping track of object locations over time allows to
B Matteo Dunnhofer understand which objects are moving, which of them are pas-
matteo.dunnhofer@uniud.it
sively captured while not interacted, and how the user relates
Antonino Furnari to the scene.
furnari@dmi.unict.it
The benefits of tracking in FPV have been explored by a
Giovanni Maria Farinella few previous works in the literature.
gfarinella@dmi.unict.it
For example, visual trackers have been exploited in
Christian Micheloni solutions to comprehend social interactions through faces
christian.micheloni@uniud.it
(Aghaei et al., 2016a, 2016b; Grauman et al., 2022), to
1 Machine Learning and Perception Lab, University of Udine, improve the performance of hand detection for rehabilita-
Via delle Scienze 206, 33100 Udine, Italy tion purposes (Visee et al., 2020), to capture hand move-
2 Image Processing Laboratory, University of Catania, Viale A. ments for action recognition (Kapidis et al., 2019), and to
Doria 6, 95125 Catania, Italy
123
260 International Journal of Computer Vision (2023) 131:259–283
forecast human-object interactions through the analysis of Visually tracking a generic object in an automatic way
hand trajectories (Liu et al., 2020). Such applications have introduces several different challenges that include occlu-
been made possible trough the development of customized sions, pose or scale changes, appearance variations, and fast
tracking approaches to track specific target categories like motion. The computer vision community has made signif-
people (Alletto et al., 2015; Nigam & Rameshan, 2017), icant progress in the development of algorithms capable of
people faces (Aghaei et al., 2016a; Grauman et al., 2022), tracking arbitrary objects in unconstrained scenarios affected
or hands (Kapidis et al., 2019; Han et al., 2020; Liu et al., by those issues. The advancements have been possible thanks
2020; Mueller et al., 2017; Sun et al., 2010; Visee et al., 2020) to the development of new and effective tracking principles
from a first person perspective. (Bolme et al., 2010; Bertinetto et al., 2016b; Bhat et al., 2019;
Despite the aforementioned attempts to leverage track- Dai et al., 2020; Danelljan et al., 2017a; Henriques et al.,
ing in egocentric vision pipelines, the standard approach 2015; Guo et al., 2021; Zhang et al., 2020; Yan et al., 2021),
to generic-object continuous localisation in FPV tasks still and to the careful design of benchmark datasets (Fan et al.,
relies on detection models that evaluate video frames inde- 2019; Galoogahi et al., 2017; Huang et al., 2019; Li et al.,
pendently (Damen et al., 2018, 2021; Furnari & Farinella, 2016; Mueller et al., 2016; Wu et al., 2015) and competitions
2020; Ma et al., 2016; Rodin et al., 2021; Sener et al., 2020; (Kristan et al., 2017, 2019, 2020, 2021) that well repre-
Wang et al., 2020; Wu et al., 2019). This paradigm has the sent the aforementioned challenging situations. However, all
drawback of ignoring all the temporal information coming these research endeavours have taken into account mainly the
from the object appearance and motion contained in con- classic third person scenario in which the target objects are
secutive video frames. Also, it generally requires a higher passively observed from an external point of view and where
computational cost due to the need to repeat the detection they do not interact with the camera wearer. It is a matter of
process in every frame. In contrast, visual object tracking fact that the nature of images and videos acquired from the
aims to exploit past information about the target to infer its first person viewpoint is inherently different from the type
position and shape in the next frames of a video (Maggio & of image captured from video cameras set as on an external
Cavallaro, 2011; Smeulders et al., 2014). This process can point of view. As we will show in this paper, the particular
improve the efficiency of algorithmic pipelines because of characteristics of FPV, such as the interaction between the
the reduced computational resources needed, but most impor- camera wearer and the objects as well as the proximity of
tantly because it allows to maintain the spatial and temporal the scene and the camera’s point of view, cause the afore-
reference to specific object instances. mentioned challenges to occur with a different nature and
Fig. 1 In this paper, we study the problem of visual object tracking in Each number in the top left corner reports the frame index. For each
the context of FPV. To achieve such a goal, we introduce a new bench- sequence, the action performed by the camera wearer is also reported
mark dataset named TREK-150, of which some qualitative examples (verb in orange, noun in blue). As can be noted, objects undergo sig-
of sequences are represented in this Figure. In each frame, the white nificant appearance and state changes due to the manipulation by the
rectangle represents the ground-truth bounding box of the target object. camera wearer, which makes the proposed setting challenging for cur-
The orange and yellow boxes localize left and right hands respectively rent trackers.
(plain lines indicate the interaction between the hand and the target).
123
International Journal of Computer Vision (2023) 131:259–283 261
distribution, resulting in the persistent occlusion, significant In sum, the main contribution of this manuscript is the
scale and state changes of objects, as well as an increased first systematic analysis of visual object tracking in FPV. In
presence of motion blur and fast motion (see Fig. 1). addition to that, our study brings additional innovations:
While the use cases of object tracking in egocentric vision
are manifold and the benefit of tracking generic objects is (i) The description and release of the new TREK-150
clear as previously discussed, it is evident that visual object dataset, which offers new challenges and complemen-
tracking is still not a dominant technology in FPV. Only very tary features with respect to existing visual tracking
recent FPV pipelines are starting to employ generic object benchmarks;
trackers (Grauman et al., 2022; Rai et al., 2021), but a solu- (ii) A new measure to assess the tracker’s ability to maintain
tion specifically designed to track generic objects in first temporal reference to targets;
person videos is still missing. We think this lack of interest (iii) A protocol to evaluate the performance of trackers with
towards visual object tracking in FPV is mainly due to the lim- respect to a downstream task;
ited amount of knowledge present in the literature about the (iv) Four FPV baseline trackers, two based on FPV object
capabilities of current visual object trackers in FPV videos. detectors and two combining such detectors with a state-
Indeed, this gap in the research opens many questions about of-the-art generic object tracker.
the impact of the first person viewpoint on visual trackers:
can the trackers available nowadays be used “off-the-shelf”? Our results show that FPV offers new and challenging
How does FPV impact current methodologies? Which track- tracking scenarios for the most recent and accurate track-
ing approaches work better in FPV scenarios? What factors ers (Dai et al., 2020; Danelljan et al., 2019, 2017a; Song et
influence the most the tracking performance? What is the al., 2018; Wang et al., 2021) and even for FPV trackers. We
contribution of trackers in FPV? We believe that the partic- study the factors causing such performance and highlight
ular setting offered by FPV deserves a dedicated analysis possible future research directions. Despite the difficulties
that is still missing in the literature, and we argue that fur- introduced by FPV, we prove that trackers bring benefits to
ther research on this problem cannot be pursued without a FPV downstream tasks requiring short-term object tracking
thorough study on the impact of FPV on tracking. such as hand-object interaction. Given our results and con-
In this paper, we aim to extensively analyze the prob- sidering the potential impact in FPV, we expect that generic
lem of visual object tracking in the FPV domain in order object tracking will gain popularity in this domain as new
to answer the aforementioned questions. Given the lack of and FPV-specific methodologies are investigated.1
suitable benchmarks, we follow the standard practice of the
visual tracking community that suggests to build a curated
dataset for evaluation (Galoogahi et al., 2017; Kristan et al., 2 Related Work
2019; Liang et al., 2015; Li et al., 2016; Lukezic et al., 2019;
Mueller et al., 2016; Wu et al., 2015). Hence, we propose 2.1 Visual Tracking in FPV
a novel visual tracking benchmark, TREK-150 (TRacking-
Epic-Kitchens-150), which is obtained from the large and There have been some attempts to tackle visual tracking in
challenging FPV dataset EPIC-KITCHENS (EK) (Damen et FPV. Alletto et al. (2015) improved the TLD tracker (Kalal et
al., 2018, 2021). TREK-150 provides 150 video sequences al., 2012) with a 3D odometry-based module to track people.
which we densely annotated with the bounding boxes of a For a similar task, Nigam and Rameshan (2017) proposed
single target object the camera wearer interacts with. The EgoTracker, a combination of the Struck (Hare et al., 2016)
dense localization of the person’s hands and the interaction and MEEM (Zhang et al., 2014) trackers with a person re-
state between those and the target are also provided. Addi- identification module. Face tracking was tackled by Aghaei et
tionally, each sequence has been labeled with attributes that al. (2016a) through a multi-object tracking approach termed
identify the visual changes the object is undergoing, the class extended-bag-of-tracklets. Hand tracking was studied in sev-
of the target object, as well as the action he/she is perform- eral works (Han et al., 2020; Kapidis et al., 2019; Mueller et
ing. By exploiting the dataset, we present an extensive and al., 2017; Visee et al., 2020; Sun et al., 2010). Sun et al. (2010)
in-depth study of the accuracy and speed performance of 38 developed a particle filter framework for hand pose tracking.
established generic object trackers and of 4 newly introduced Mueller et al. (2017) instead proposed a solution based on an
baseline FPV trackers. We leverage standard evaluation pro- RGB camera and a depth sensor, while Kapidis et al. (2019)
tocols and metrics and propose new ones. This is done in and Visee et al. (2020) combined the YOLO (Redmon et
order to evaluate the capabilities of the trackers in relation to al., 2016) detector trained for hand detection with a visual
specific FPV scenarios. Furthermore, we assess the trackers’
performance by evaluating their impact on the FPV-specific 1Annotations, trackers’ results, and code are available at https://
downstream task of human-object interaction detection. machinelearning.uniud.it/datasets/trek150/.
123
262 International Journal of Computer Vision (2023) 131:259–283
tracker. The former work used the multi-object tracker Deep- (De la Torre et al., 2009) was introduced for studying the
SORT (Wojke et al., 2018), whereas the latter employed the recognition of the actions performed by the camera wearer.
KCF (Henriques et al., 2015) single object tracker. Han et al. Videos belonging to this dataset are annotated with labels
(2020) exploited a detection-by-tracking approach on video expressing only the actions performed (up to 31) by the per-
frames acquired with 4 fisheye cameras. son, and they comprise around 200K frames. The EGTEA
All the aforementioned solutions focused on tracking spe- Gaze+ dataset (Li et al., 2018) extended the FPV scenar-
cific targets (i.e., people, faces, or hands), and thus they are ios represented in the previous dataset by providing 2.4 M
likely to fail in generalizing to arbitrary target objects. More- frames. Similarly as (De la Torre et al., 2009), only labels for
over, they have been validated on custom designed datasets, the actions performed by the camera wearer have been asso-
which limits the reproducibility of the works and the abil- ciated to the videos. In addition to the action labels, the ADL
ity to compare them to other solutions. In contrast, we focus dataset (Pirsiavash & Ramanan, 2012) introduced around
on the evaluation of algorithms for the generic object track- 137K annotations in the form of bounding boxes for the local-
ing task. We design our evaluation to be reproducible and ization of the objects involved in the actions. Other than for
extendable by releasing TREK-150, a set of 150 videos of the action recognition task, the MECCANO dataset (Ragusa
different objects, which we believe will be useful to study et al., 2020) was aimed to study active object detection and
object tracking in FPV. To the best of our knowledge, ours recognition as well as hand-object interaction. The dataset is
is the first attempt to evaluate systematically and in-depth designed to represent an industrial-like scenario and provides
generic object tracking in FPV. 299K frames, 64K bounding-boxes, 60 action labels, and 20
object categories. The EPIC-KITCHENS dataset (Damen et
2.2 Visual Tracking for Generic Settings al., 2018, 2021) is currently one of the largest and most rep-
resentative datasets available for vision-based tasks based on
In recent years, there has been an increased interest in an egocentric point of view. It is composed of 20 M frames
developing accurate and robust tracking algorithms for and provides annotations for action recognition, action antic-
generic objects and domains. Preliminary trackers were ipation, and object detection.
based on mean shift algorithms (Comaniciu et al., 2000), Despite the extensive amount of labels for different FPV
key-point (Maresca & Petrosino, 2013), part-based methods tasks, all the aforementioned datasets (Damen et al., 2018,
(Čehovin et al., 2013; Nam et al., 2014), or SVM (Hare et 2021; Pirsiavash & Ramanan, 2012; Ragusa et al., 2020) do
al., 2016) and incremental (Ross et al., 2008) learning. Later, not offer annotations to study object tracking. This is because
solutions based on correlation filters gained popularity thanks the available bounding boxes for the localization of objects
to their processing speed (Bolme et al., 2010; Bertinetto et al., are not relative to the specific instances of the objects but only
2016a; Danelljan et al., 2017b; Henriques et al., 2015; Kiani to their categories. Such kind of annotations does not allow to
Galoogahi et al., 2017). More recently, algorithms based on distinguish different objects of the same category when these
deep learning have been proposed to extract efficient image appear together in the images. Furthermore, such datasets
and object features. This kind of representation has been used provide only sparse annotations (typically at 1/2 FPS) and
in deep regression networks (Dunnhofer et al., 2021; Held they do not provide tracking-specific annotations (Müller et
et al., 2016), online tracking-by-detection methods (Nam & al., 2018; Kristan et al., 2017; Wu et al., 2015). Hence, they
Han, 2016; Song et al., 2018), approaches based on reinforce- cannot be used for an accurate and in-depth evaluation of
ment learning (Dunnhofer et al., 2019; Yun et al., 2017), deep trackers in FPV. To the best of our knowledge, our proposed
discriminative correlation filters (Bhat et al., 2019, 2020; TREK-150 dataset is the first tool that provides the chance of
Danelljan et al., 2017a, 2019, 2020; Lukežič et al., 2020), studying in-depth the visual object tracking task in the con-
trackers based on siamese networks (Bertinetto et al., 2016b; text of first-person viewpoint egocentric videos. In addition,
Guo et al., 2021; Li et al., 2019; Wang et al., 2019; Zhang with the release of dense annotations for the position of the
et al., 2020), and more recently in trackers built up on trans- camera wearer’s hands, for the state of interaction between
former architectures (Chen et al., 2021; Wang et al., 2021; hands and the target object, and for the action performed
Yan et al., 2021). All these methods have been designed for by the camera wearer, TREK-150 is suitable to analyze the
tracking arbitrary target objects in unconstrained domains. visual tracking task in relation to all those FPV-specific tasks
However, no solution has been studied and validated on a that require continuous and dense object localization (e.g.
number of diverse FPV sequences as we propose in this paper. human-object interaction).
Different datasets are currently available in the FPV com- Disparate bounding-box level benchmarks are available
munity for the study of particular tasks. The CMU dataset today to evaluate the performance of single-object visual
123
International Journal of Computer Vision (2023) 131:259–283 263
tracking algorithms. The Object Tracking Benchmarks (OTB) 3 The TREK-150 Benchmark
OTB-50 (Wu et al., 2013) and OTB-100 (Wu et al., 2015) are
two of the most popular benchmarks in the visual tracking In this section, we describe TREK-150, the novel dataset pro-
community. They provide 51 and 100 sequences respectively, posed for the study of the visual object tracking task in FPV.
including generic target objects like vehicles, people, faces, TREK-150 is composed of 150 video sequences. In each
toys, characters, etc. The Temple-Color 128 (TC-128) dataset sequence, a single target object is labeled with a bounding
(Liang et al., 2015) comprises 128 videos that were acquired box which encloses the appearance of the object in each frame
for the evaluation of color-enhanced trackers. The UAV123 in which the object is visible (as a whole or in part). Every
dataset (Mueller et al., 2016) was constructed to benchmark sequence is additionally labeled with attributes describing the
the tracking progress on videos captured by unmanned aerial visual variability of the target and the scene in the sequence.
vehicles (UAVs) cameras. The 123 videos included in this To study the performance of trackers in the setting of human-
benchmark represent 9 different classes of target. The NUS- object interaction, we provide bounding box localization of
PRO dataset (Li et al., 2016) contains 365 sequences and aims hands and labels for their state of interaction with the target
to benchmark human and rigid object tracking with targets object. Moreover, two additional verb and noun attributes are
belonging to one of 8 categories. The Need for Speed (NfS) provided to indicate the action performed by the person and
dataset (Galoogahi et al., 2017) provides 100 sequences with the class of the target, respectively. Some qualitative exam-
a frame rate of 240 FPS. The aim of the authors was to ples of the video sequences with the relative annotations are
benchmark the effects of frame rate variations on the track- shown in Fig. 1. Table 1 reports key statistics of our dataset
ing performance. The VOT2019 benchmark (Kristan et al., in comparison with existing tracker evaluation benchmarks.
2019) was the last iteration of the annual Visual Object It is worth noticing that the proposed dataset is competitive
Tracking challenge that required bounding-boxes as target in terms of size with respect to the evaluation benchmarks
object representation. This dataset contains 60 highly chal- available in the visual (single) object tracking community.
lenging videos, with generic target objects belonging to 30 We remark that TREK-150 has been designed for the eval-
different categories. The Color and Depth Tracking Bench- uation of visual tracking algorithms in FPV regardless of
mark (CDTB) dataset (Lukezic et al., 2019) offers 80 RGB their methodology. Indeed, in this paper, we do not aim to
sequences paired with a depth channel. This benchmark aims provide a large-scale dataset for the development of deep
to explore the use of depth information to improve tracking. learning-based trackers. Instead, our goal is to assess the
The Transparent Object Tracking Benchmark (TOTB) (Fan impact of the first-person viewpoint on current trackers. To
et al., 2021) provides 225 videos of transparent target objects, achieve this goal we follow the standard practice in the visual
and has been introduced to study the robustness of trackers object tracking community (Fan et al., 2021; Galoogahi et al.,
to the particular appearance of such kind of objects. 2017; Kristan et al., 2019; Liang et al., 2015; Li et al., 2016;
Following the increased development of deep learning- Lukezic et al., 2019; Mueller et al., 2016; Wu et al., 2015)
based trackers, large-scale generic-domain tracking datasets that suggests to set up a small but well described dataset to
have been recently released (Müller et al., 2018; Huang et benchmark the tracking progress.
al., 2019; Fan et al., 2021). These include more than a thou-
sand videos normally split into training and test subsets. The 3.1 Data Collection
evaluation protocol associated with these sets requires the
evaluation of the trackers after they have been trained on the 3.1.1 Video Collection
provided training set.
Even though all the presented benchmarks offer various The videos contained in TREK-150 have been sampled from
tracking scenarios, and some of them may include videos EK (Damen et al., 2018, 2021), which is a public, large-
acquired from a first person point of view, no one was specif- scale, and diverse dataset of egocentric videos focused on
ically designed for tracking in FPV. Moreover, since in this human-object interactions in kitchens. This is currently one
paper we aim to benchmark the performance of visual object of the largest datasets for understanding human-object inter-
trackers regardless of their approach, we follow the practice actions in FPV. Thanks to its dimension, EK provides a
of previous works (Fan et al., 2021; Galoogahi et al., 2017; significant amount of diverse interaction situations between
Kristan et al., 2019; Li et al., 2016; Liang et al., 2015; Lukezic various people and several different types of objects. Hence,
et al., 2019; Mueller et al., 2016; Wu et al., 2015) and set up a it allows us to select suitable disparate tracking sequences
well representative and described dataset for evaluation. We that reflect the common scenarios tackled in FPV tasks. EK
believe that TREK-150 is useful for the tracking community offers videos annotated with the actions performed by the
because it offers different tracking situations and new tar- camera wearer in the form of temporal bounds and verb-
get object categories that are not present in other tracking noun labels. The subset of EK known as EK-55 (Damen et
benchmarks. al., 2018) also contains sparse bounding box references of
123
264
123
Table 1 Statistics of the proposed TREK-150 benchmark compared with other benchmarks designed for single visual object tracking evaluation
Benchmark OTB-50 OTB-100 TC-128 UAV123 NUS-PRO NfS VOT2019 CDTB TOTB GOT-10k* LaSOT* TREK-150
(Wu et al., (Wu et al., (Liang et (Mueller (Li et al., (Galoogahi (Kristan et (Lukezic (Fan et al., (Huang et (Fan et al.,
2013) 2015) al., 2015) et al., 2016) et al., al., 2019) et al., 2021) al., 2019) 2019)
2016) 2017) 2019)
# Videos 51 100 128 123 365 100 60 80 225 180 280 150
# Frames 29K 59K 55K 113K 135K 383K 20K 102K 86K 23K 685k 97K
Min frames across videos 71 71 71 109 146 169 41 406 126 51 1000 161
Mean frames across videos 578 590 429 915 371 3830 332 1274 381 127 2448 649
Median frames across 392 393 365 882 300 2448 258 1179 389 100 2102 484
videos
Max frames across videos 3872 3872 3872 3085 5040 20,665 1500 2501 500 920 9999 4640
Frame rate 30 FPS 30 FPS 30 FPS 30 FPS 30 FPS 240 FPS 30 FPS 30 FPS 30 FPS 10 FPS 30 FPS 60 FPS
# Target object classes 10 16 27 9 8 17 30 23 15 84 70 34
# Sequence attributes 11 11 11 12 12 9 6 13 12 6 14 17
Target absent labels × × × × × ×
Labels for the interaction × × × × × × × × × × ×
with the target
FPV × × × × × × × × × × ×
# Action verbs n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 20
For the datasets marked with * we report the statistics of their test set
International Journal of Computer Vision (2023) 131:259–283
International Journal of Computer Vision (2023) 131:259–283 265
manipulated objects annotated at 2 frames per second in a Table 2 Selected sequence attributes
temporal window around each action. We exploited such a Attribute Meaning
feature to obtain a suitable pool of video sequences inter-
esting for object tracking. Particularly, we cross-referenced SC Scale change the ratio of the bounding-box area of the
first and the current frame is outside the range [0.5, 2]
the original verb-noun temporal annotations of EK-55 to the
ARC Aspect ratio change the ratio of the bounding-box aspect
sparse bounding box labels. This allowed to select sequences
ratio of the first and the current frame is outside the
in which the camera wearer manipulates an object during an range [0.5, 2]
action. Each sequence is composed of the video frames con- IV Illumination variation the area of the target
tained within the temporal bounds of the action, extracted bounding-box is subject to light variation
at the original 60 FPS frame rate and at the original full SOB Similar objects there are objects in the video of the same
HD frame size (Damen et al., 2018, 2021). From the initial object category or with similar appearance to the target
pool, we selected 150 video sequences which were character- RIG Rigid object the target is a rigid object
ized by attributes such as scale changes, partial/full occlusion DEF Deformable object the target is a deformable object
and fast motion, which are commonly considered in standard ROT Rotation the target rotates in the video
tracking benchmarks (Fan et al., 2019; Kristan et al., 2019; POC Partial occlusion the target is partially occluded in the
Mueller et al., 2016; Müller et al., 2018; Wu et al., 2015). video
The top part of Table 2 reports the 13 attributes considered FOC Full occlusion the target is fully occluded in the video
for the selection. OUT Out of view the target completely leaves the video frame
MB Motion blur the target region is blurred due to target or
3.2 Data Labeling camera motion
FM Fast motion the target bounding-box has a motion
change larger than its size
3.2.1 Single Object Tracking
LR Low resolution the area of the target bounding-box is
less than 1000 pixels in at least one frame
In this study, we restricted our analysis to the tracking of a
HR High resolution the area of the target bounding-box is
single target object per video. This has been done because larger than 250,000 pixels in at least one frame
in the FPV scenario a person interacts through his/her hands HM Head motion the person moves their head significantly
with one object at a time in general (Damen et al., 2018, thus causing camera motion
2021). If a person interacts with two objects at the same 1H 1 Hand interaction the person interacts with the target
time those can be still tracked by two single object trackers. object with one hand for consecutive video frames
Moreover, focusing on a single object allows us to analyze 2H 2 Hands interaction the person interacts with the target
better all the challenging and relevant factors that character- object with both hands for consecutive video frames
ize the tracking problem in FPV. We believe that future work The first block of rows describes attributes commonly used by the visual
could investigate the employment of multiple object tracking tracking community. The last four rows describe additional attributes
(MOT) (Dendorfer et al., 2021; Luiten et al., 2021) solutions introduced in this paper to characterize FPV tracking sequences
for a general understanding of the position and movement of
all objects visible in the scene. We think the in-depth study
presented in this paper will give useful insights for the devel- 97,296 frames labeled with bounding boxes related to the
opment of such methods. position and visual presence of objects the camera wearer is
interacting with. Following the initial annotations of EK-55,
3.2.2 Frame-Level Annotations we employed axis-aligned bounding boxes to localize the tar-
get objects. This design choice is supported by the fact that
After selection, the 150 sequences were associated to only such a representation is largely used in many FPV pipelines
3000 bounding boxes, due to the sparse nature of the object (Furnari & Farinella, 2020; Furnari et al., 2017; Furnari &
annotations in EK-55. Since it has been shown that visual Farinella, 2019; Damen et al., 2018; Kapidis et al., 2019;
tracking benchmarks require dense and accurate box anno- Shan et al., 2020; Visee et al., 2020). Therefore, computing
tations (Fan et al., 2019; Kristan et al., 2019; Mueller et al., tracking metrics based on such representations allows us to
2016; Valmadre et al., 2018), we re-annotated the bounding correlate the results with those of object localization pipelines
boxes of the target objects on the 150 sequences selected. in FPV tasks, ultimately better highlighting the impact of
Batches of sequences were delivered to annotators (21 sub- trackers in such contexts. Also, the usage of more sophisti-
jects) who were instructed to perform the labeling. Such cated target representation would have restricted our analysis
initial annotations were then carefully checked and refined since the majority of state-of-the-art trackers output just axis-
by a PhD student, and finally revised by an early-stage aligned bounding boxes (Bertinetto et al. 2016a, b; Bhat et
researcher and by two professors. This process produced al., 2019; 2020; Bolme et al., 2010; Chen et al., 2020; Dai
123
266 International Journal of Computer Vision (2023) 131:259–283
et al., 2020; Danelljan et al. Danelljan et al. 2017a, b, 2019, bounding-box annotations of TREK-150 in contrast to the
2020; Fu et al. 2021; Guo et al., 2021; Held et al., 2016; ones available in the popular OTB-100 tracking benchmark.
Henriques et al., 2015; Huang et al., 2020; Kiani Galoogahi In addition to the bounding boxes for the object to be
et al., 2017; Li et al., 2018, 2019; Nam & Han, 2016; Park tracked, TREK-150 provides per-frame annotations of the
& Berg, 2018; Song et al., 2018; Wang et al., 2018, 2021; location of the left and right hand of the camera wearer and
Xu et al., 2020; Yan et al., 2019, 2021; Zhang & Peng, of the state of interaction happening between each hand and
2019; Zhang et al., 2020), and their recent progress on vari- the target object. Interaction annotations consist of labels
ous benchmarks using such representation (Wu et al., 2015; expressing which hand of the camera wearer is currently in
Mueller et al., 2016; Galoogahi et al., 2017; Lukezic et al., contact with the target object (e.g., we used the labels LHI,
2019; Fan et al., 2021; Müller et al., 2018; Fan et al., 2019; RHI, BHI to express whether the person is interacting with
Huang et al., 2019) proves that it provides sufficient infor- the target by her/his left or right hand or with both hands). We
mation for tracker initialization and consistent and reliable considered an interaction happening even in the presence of
performance evaluation. Moreover, we point out that many an object acting as a medium between the hand and the target.
of the objects commonly appearing in FPV scenarios are dif- E.g., we considered the camera wearer to interact with a dish
ficult to annotate consistently with more sophisticated target even if a sponge is in between her/his hand and the dish. The
representations.footref We remark that the proposed bound- fourth row of Fig. 1 shows a visual example of these situa-
ing boxes have been carefully and tightly drawn around the tions. These kinds of annotations have been obtained by the
visible parts of the objects. Figure 13 of the supplemen- manual refinement (performed by the four aforementioned
tary document shows some examples of the quality of the subjects) of the output given by the FPV hand-object inter-
123
International Journal of Computer Vision (2023) 131:259–283 267
action detector Hands-in-Contact (HiC) (Shan et al., 2020). et al., 2020), KYS (Bhat et al., 2020), KeepTrack (Mayer
In total, 166,883 hand bounding boxes (82,678 for the left et al., 2021)). We also considered deep siamese networks
hand, 84,205 for the right hand) and 77,993 interaction state (SiamFC (Bertinetto et al., 2016b), GOTURN (Held et al.,
labels (24,466 for interaction with left hand, 16,171 with right 2016), DSLT (Lu et al., 2018), SiamRPN++ (Li et al., 2019),
hand, 37,356 with both hands) are present in TREK-150. SiamDW (Zhang & Peng, 2019), UpdateNet (Zhang et al.,
2019), SiamFC++ (Xu et al., 2020), SiamBAN (Chen et al.,
3.2.3 Sequence-Level Annotations 2020), Ocean (Zhang et al., 2020), SiamGAT (Guo et al.,
2021), STMTrack (Fu et al., 2021)), tracking-by-detection
The sequences have been also labeled considering 17 methods (MDNet (Nam & Han, 2016), VITAL (Song et
attributes which define the motion and visual appearance al., 2018)), as well as trackers based on target segmenta-
changes the target object or the scene is subject to. These are tion representations (SiamMask (Wang et al., 2019), D3S
used to analyze the performance of the trackers under differ- (Lukežič et al., 2020)), meta-learning (MetaCrest (Park &
ent aspects that may influence their execution. The attributes Berg, 2018)), fusion of trackers (TRASFUST (Dunnhofer et
employed in this study include 13 attributes used in standard al., 2020)), neural architecture search (LightTrack (Yan et
tracking benchmarks (Fan et al., 2019; Müller et al., 2018; al., 2021)), and transformers (TrDiMP (Wang et al., 2021),
Wu et al., 2015), plus 4 additional new ones (High Resolu- TransT (Chen et al., 2021), STARK (Yan et al., 2021)).
tion, Head Motion, 1-Hand Interaction, 2-Hands Interaction) The long-term (Lukezic et al., 2020) trackers SPLT (Yan
which have been introduced in this paper to character- et al., 2019), GlobalTrack (Huang et al., 2020), and LTMU
ize sequences from FPV-specific point of views. The 17 (Dai et al., 2020) have been also taken into account in the
attributes are defined in Table 2. Fig. 2a reports the distri- study. These kinds of trackers are designed to address longer
butions of the sequences with respect to the 17 attributes, target occlusion and out of view periods by exploiting an
while Fig. 2b compares the distributions of the most common object re-detection module. All of the selected trackers are
attributes in the field in TREK-150 and in other well-known state-of-the-art approaches published between the years 2010
tracking benchmarks. Our dataset provides a larger number and 2021. Table 3 reports detailed information about the
of sequences affected by partial occlusions (POC), changes 38 considered generic-object trackers regarding the: venue
in scale (SC) and/or aspect ratio (ARC), motion blur (MB), and year of publication; type of image representation used;
and illumination variation (IV). These peculiarities are due type of target matching strategy; employment of target model
to the particular first person viewpoint and to the human- updates; and category of tracker according to the classifica-
object interactions which affect the camera motion and the tion of (Lukezic et al., 2020). For each tracker, we used the
appearance of objects. Based on the verb-noun labels of code publicly available and adopted default parameters in
EK, sequences were also associated to 20 verb labels (e.g., order to have a fair comparison between the different tracking
“wash”—see Fig. 1) and 34 noun labels indicating the cat- methodologies (i.e., to avoid comparisons between track-
egory of the target object (e.g., “box”). Fig. 3a–b report the ers specifically optimized for TREK-150 and non-optimized
distributions of the videos with respect to verb and target trackers). The original hyper-parameter values lead to the
object labels. As can be noted, our benchmark reflects the best and most likely generalizable instances of all the track-
long-tail distribution of labels in EK (Damen et al., 2018). ers. The code was run on a machine with an Intel Xeon
E5-2690 v4 @ 2.60GHz CPU, 320 GB of RAM, and an
NVIDIA TITAN V GPU.
4 Trackers
4.2 FPV Trackers
4.1 Generic Object Trackers
Since there are no public implementations of the FPV track-
Among the examined trackers, 38 have been selected to rep- ers described in Sect. 2.1, we introduce 4 new FPV-specific
resent different popular approaches to generic-object visual tracking baselines.
tracking. Specifically, in the analysis we have included short-
term trackers (Lukezic et al., 2020) based on both correlation- 4.2.1 TbyD-F/H
filters with hand-crafted features (MOSSE (Bolme et al.,
2010), DSST (Danelljan et al., 2017b), KCF (Henriques et The first two FPV baselines build up on FPV-specific object
al., 2015), Staple (Bertinetto et al., 2016a), BACF (Kiani detectors (Damen et al., 2018; Shan et al., 2020). Consid-
Galoogahi et al., 2017), DCFNet (Wang et al., 2017), STRCF ering that they are popular approach for object localization
(Li et al., 2018), MCCTH (Wang et al., 2018)) and deep fea- in FPV and off-the-shelf FPV-trained instances are publicly
tures (ECO (Danelljan et al., 2017a), ATOM (Danelljan et available, we tested whether they can be used as naïve track-
al., 2019), DiMP (Bhat et al., 2019), PrDiMP (Danelljan ing baselines. To this end, we define a simple processing
123
Table 3 Characteristics of the generic object trackers considered in our evaluation
268
Tracker Venue Image representation Matching operation Model update Class given by Lukezic et al. (2020) Offline training dataset
Pixel HOG Color CNN CF CC Concat T-by-D Had Tra ST0 ST1 LT0 LT1
123
MOSSE (Bolme et al., 2010) CVPR 2010
DSST (Danelljan et al., 2017b) BMVC 2014
KCF (Henriques et al., 2015) TPAMI 2015
MDNet (Nam & Han, 2016) CVPR 2016 VGG-M I, O, IV
Staple (Bertinetto et al., 2016a) CVPR 2016
SiamFC (Bertinetto et al., 2016b) ECCVW 2016 AlexNet G
GOTURN (Held et al., 2016) ECCV 2016 AlexNet ID, A
ECO (Danelljan et al., 2017a) CVPR 2017 VGG-M
BACF (Kiani Galoogahi et al., 2017) ICCV 2017
DCFNet (Wang et al., 2017) ArXiv 2017 VGG-M
VITAL (Song et al., 2018) CVPR 2018 VGG-M I, O, IV
STRCF (Li et al., 2018) CVPR 2018
MCCTH (Wang et al., 2018) CVPR 2018
DSLT (Lu et al., 2018) ECCV 2018 VGG-16 ID, IV, C
MetaCrest (Park & Berg, 2018) ECCV 2018 VGG-M I, ID, V
SiamRPN++ (Li et al., 2019) CVPR 2019 ResNet-50 I, C, ID, IV, Y
SiamMask (Wang et al., 2019) CVPR 2019 ResNet-50 I, ID, YV
SiamDW (Zhang & Peng, 2019) CVPR 2019 ResNet-22 I, ID, Y
ATOM (Danelljan et al., 2019) CVPR 2019 ResNet-18 I, C, L, T
DiMP (Bhat et al., 2019) ICCV 2019 ResNet-50 I, C, L, T, G
SPLT (Yan et al., 2019) ICCV 2019 ResNet-50 IV, ID
UpdateNet (Zhang et al., 2019) ICCV 2019 AlexNet L
SiamFC++ (Xu et al., 2020) AAAI 2020 AlexNet Y, ID, IV, C, G
GlobalTrack (Huang et al., 2020) AAAI 2020 ResNet-50 C, G, L
PrDiMP (Danelljan et al., 2020) CVPR 2020 ResNet-50 I, C, L, T, G
SiamBAN (Chen et al., 2020) CVPR 2020 ResNet-50 IV, ID, C, G, L, Y
International Journal of Computer Vision (2023) 131:259–283
Table 3 continued
Tracker Venue Image representation Matching operation Model update Class given by Lukezic et al. (2020) Offline training dataset
Pixel HOG Color CNN CF CC Concat T-by-D Had Tra ST0 ST1 LT0 LT1
123
269
270 International Journal of Computer Vision (2023) 131:259–283
Fig. 4 Scheme of execution of the proposed FPV baseline trackers LTMU-F and LTMU-H based on LTMU (Dai et al., 2020)
procedure which we found to work surprisingly well. At the target in the whole frame. The re-detector returns some candi-
first frame of a tracking sequence, the initial bounding box is date locations which may contain the target and each of these
memorized as current information about the target’s object’s is checked by the verification module. The candidate patch
position. Then, at every other frame, an FPV object detector with the highest confidence is given as output and used as a
is run to provide the boxes of all object instances present in new target location to re-initialize the short-term tracker. The
the frame. As output for the current frame, the bounding-box verifier’s output as well as the tracker’s confidence are used
having larger intersection-over-union (IoU) with the previ- to decide when to update the parameters of the first. Based
ously memorized box is given. If the detector does not output on experiments, we used STARK (Yan et al., 2021) as short-
detections for a particular frame or none of its predicted boxes term tracker and the aforementioned FPV-based detectors as
has IoU greater than 0, then the previously memorized box is re-detection modules. For LTMU-F, such a module has been
given as output for the current frame. As object detectors, we set to retain the first 10 among the many detections given as
used the EK-55 trained Faster-R-CNN (Damen et al., 2018; output, considering a ranking based on the scores attributed
Ren et al., 2015) and the Faster-R-CNN-based hand-object by the detector to each detection. If no detection is given for
interaction detector HiC (Shan et al., 2020). The tracking a frame, the last available position of the target is considered
baseline built upon the first detector is referred to as TbyD-F, as a candidate location. For LTMU-H, we used the object
while the one built on the second as TbyD-H. localizations of the hand-object interaction detections given
by the FPV version of HiC (Shan et al., 2020) as target can-
4.2.2 LTMU-F/H didate locations. HiC is implemented as an improved Faster
R-CNN which is set to provide, at the same time, the local-
We developed 2 other FPV-specific trackers in addition to the ization of hands and interacted objects, as well as their state
aforementioned ones. In this case, we wanted to combine the of interaction. As for LTMU-F, if no detection is given for a
capabilites of generic object trackers with the FPV-specific frame, the last available position of the target is considered as
object localization abilities of detectors (Damen et al., 2018; a candidate location. For both detection methods, the orig-
Shan et al., 2020). Particularly, the baselines combine the inal pre-trained models provided by the authors have been
LTMU tracker (Dai et al., 2020) with FPV-specific object used. The described setups, the common scheme of which is
detectors. The first solution, referred to as LTMU-F, employs presented in Fig. 4, give birth to two new FPV trackers that
the Faster-R-CNN object detector trained on EK-55 (Damen implement conceptually different strategies for FPV-based
et al., 2018), while the second, denoted as LTMU-H, uses object localization and tracking. Indeed, the first solution
the hand-object detector HiC (Shan et al., 2020). These two aims to just look for objects in the scene, while the second
trackers exploit the respective detectors as re-detection mod- one reasons in terms of the interaction happening between
ules according to the LTMU scheme (Dai et al., 2020). For the camera wearer and the objects.
a better understanding, we briefly recap the processing pro- The choice of using LTMU (Dai et al., 2020) as a baseline
cedure of the LTMU tracker (Dai et al., 2020). After being methodology stems from its highly modular scheme which
initialized with the target in the first frame of a sequence, at makes it the most easily configurable tracker with state-of-
every other frame LTMU first executes a short-term tracker the-art performance available today. We took advantage of
that tracks the target in a local area of the frame based on the commodity of a such framework to insert the FPV-specific
the target’s last position. The patch extracted from the box modules described before.
prediction of the tracker is evaluated by an online-learned
verification module based on MDNet (Nam & Han, 2016),
which outputs a probability estimate of the target being con- 5 Evaluation Settings
tained in the patch. Such an estimate in companion with the
tracker’s predicted traget presence are used to decide if the 5.1 Evaluation Protocols
short-term tracker is tracking the target or not. If it is, its
predicted box is given as output for the current frame. In the The protocols used to execute the trackers are described in
other case, a re-detection module is executed to look for the the following.
123
International Journal of Computer Vision (2023) 131:259–283 271
5.1.1 One-Pass Evaluation until its end. Then, as for the OPE, predicted and ground-truth
boxes are compared to obtain performance scores for each
We employed the one-pass evaluation (OPE) protocol detailed sub-sequence. Scores for a single sequence are computed by
in (Wu et al., 2015) which implements the most realistic way averaging the scores of each sub-sequence weighted by their
to run a tracker in practice. The protocol consists of two main length in number of frames. Similarly, the overall scores for
stages: (i) initializing a tracker with the ground-truth bound- the whole dataset are obtained by averaging each sequence’s
ing box of the target in the first frame; (ii) letting the tracker score weighted by its number of frames. We refer to this pro-
run on every subsequent frame until the end of the sequence tocol as multi-start evaluation (MSE). It allows a tracker to
and record predictions to be considered for the evaluation. better cover all the situations happening in the sequences,
To obtain performance scores for each sequence, predictions ultimately leading to more robust evaluation scores.
and ground-truth bounding boxes are compared according to
some distance measure only in frames where ground-truths 5.1.3 Hand-Object Interaction Evaluation
are present (ground-truth bounding boxes are not given for
frames in which the target is fully occluded or out of the field We also evaluated trackers in relation to a video-based hand-
of view). The overall scores are obtained by averaging the object interaction (HOI) detection solution. This is done in
scores achieved for every sequence. order to assess their direct impact on a downstream FPV-
The tracker initialization with the ground-truth is per- specific task. The aim of this problem is to determine when
formed to evaluate the trackers in the best possible conditions, and where in the frames the camera wearer is interacting (e.g.,
i.e. when accurate information about the target is given. In by touching/manipulating) with an object with his/her hands.
practical applications, such a user-defined information is gen- Considering the requirement of generic object localization
erally unavailable. We expect this scenario to occur especially (Shan et al., 2020), we think a video-based configuration of
in FPV applications where object localization is obtained via such a problem to be a suitable task to exploit visual object
detectors (Damen et al., 2018; Shan et al., 2020). Detectors trackers. To achieve the goal, we built a solution composed
predict bounding boxes with spatial noise (in the position of a HiC instance (Shan et al., 2020) to detect the hands and
and/or in the scale), and the initialization of trackers with their state of interaction with an object and a visual tracker to
such a noisy information could influence the tracking per- maintain the reference to it. The HiC detector is run at every
formance. Hence, to understand the impact of the initial box frame until it finds a valid HOI detection. Such an event
given by an object detector, we consider a version of the OPE is said to occur when the bounding box predictions for the
protocol, referred to as OPE-D, where each tracker is initial- hands have an IoU ≥ 0.5 with the hand ground-truth boxes,
ized in the first frame in which the detector’s prediction has the predicted interaction state is “in contact”, and the object
IoU ≥ 0.5 with the ground-truth box. From such a frame (that bounding box has an IoU ≥ 0.5 with the ground-truth box
could be delayed in time with respect to the beginning of the (Shan et al., 2020). Then, the predicted object-related box is
sequence), each tracker is also run with the ground-truth box. used to initialize the tracker, and for the subsequent frames,
The change in the metric values obtained after running the it is run to provide the localization of that object (that is the
two modalities are used to quantify the impact of the initial- one involved in the interaction). A graphical representation
ization box. of the execution of the described pipeline is given in Fig. 5.
Taking inspiration from the metric used by Shan et al. (2020)
5.1.2 Multi-Start Evaluation to evaluate HiC on static images, we quantify the perfor-
mance of the proposed pipeline by the normalized count of
To obtain a more robust evaluation (Kristan et al., 2016), frames in which the given HOI detection matches the ground-
especially for the analysis over sequence attributes and action truth annotation available. Such matching is said to happen
verbs, we employed the recent protocol proposed in (Kristan when the bounding box predictions for the hands have an
et al., 2020), which defines different points of initialization IoU ≥ 0.5 with the hand ground-truth boxes, the predicted
along a video. In more detail, for each sequence, different interaction state is “in contact”, and the object bounding box
initialization points—called anchors—separated by 2 s are has an IoU ≥ 0.5 with the ground-truth box (Shan et al.,
defined. Anchors are always set in the first and last frames 2020). For our experiments, we restricted the analysis of the
of a sequence. Some of the inner anchors are shifted for- solution on the sub-sequences contained in TREK-150 in
ward by a few frames in order to avoid frames in which the which an HOI is present. These are determined by consid-
target is not visible. A tracker is run on each of the sub- ering the sub-sequences of consecutive frames having the
sequences yielded by the anchor either forward or backward same interaction label (i.e., LHI, RHI, BHI). To obtain an
in time depending on the longest sub-sequence the anchor overall performance score, which we refer to as Recall, we
generates. The tracker is initialized with the ground-truth average the sub-sequence scores after having them weighted
annotation in the first frame of the sub-sequence and let run by the sub-sequence lengths in number of frames, in a similar
123
272 International Journal of Computer Vision (2023) 131:259–283
Fig. 5 Schematic visualization of the protocol designed to execute and tracking is run. The HOI detector HiC (Shan et al., 2020) is first
trackers in the context of a hand-object interaction (HOI) detection executed in every frame to obtain a valid HOI (in this example the first
task. The HOI labels provided for TREK-150 are used to consider sub- valid detection is obtained at frame 75). Once such an event is deter-
sequences of frames in which the camera wearer is interacting with the mined, the tracker is initialized with the bounding box given by HiC for
target object. In this picture, the labels BHI are employed to indicate the object involved in the interaction. The tracker is then run on all the
that an interaction by both hands is happening in the frame range [74, subsequent frames to provide the reference to such an object
120]. On such sub-sequences, a systematic pipeline for HOI detection
fashion as we did to compute score in the MSE. To evalu- from the ground-truth (Fig. 6b). As summary measures, we
ate the impact of visual trackers on this task, we switch the report the success score (SS) (Wu et al., 2015) and normal-
pipeline’s tracker with each of the ones studied in this work. ized precision scores (NPS) (Müller et al., 2018), which are
This experimental procedure gives us an estimate of the accu- computed as the Area Under the Curve (AUC) of the success
racy of the HOI detection system under configurations with plot and normalized precision plot respectively.
different trackers. More interestingly, the proposed evalua- Along with these standard metrics, we employ a novel
tion protocol allows also to build a ranking of the trackers plot which we refer to as generalized success robustness plot
based on the results of a downstream application. To the best (Fig. 6c). For this, we take inspiration from the robustness
of our knowledge, this setup brings a new way to assess the metric proposed in Kristan et al. (2020) which measures the
performance of visual object trackers. normalized extent of a tracking sequence before a failure.
We believe this aspect to be especially important in FPV as a
5.1.4 Real-Time Evaluation superior ability of a tracker to maintain longer references to
targets can lead to the better modeling of actions and inter-
Since many FPV tasks such as object interaction (Damen et actions. The original metric proposed in Kristan et al. (2020)
al., 2016) and early action recognition (Furnari & Farinella, uses a fixed threshold of 0.1 on the bounding box overlap to
2019), or action anticipation (Damen et al., 2018), require detect a collapse of the tracker. Such a value was determined
real-time computation, we evaluate trackers in such a setting mainly to reduce the chance of cheating in the VOT2020
by following the instructions given in (Kristan et al., 2017; competition and it is not necessarily the case that such a
Li et al., 2020). Explanations and results are given in the value could work well for different tracking applications. To
supplementary document. generalize the metric, we take inspiration from the success
and normalized precision plots and propose to use differ-
5.2 Performance Measures ent box overlap thresholds ranging in [0, 0.5] to determine
the collapse. We consider 0.5 as the maximum threshold as
To quantify the performance of the trackers, we used different higher overlaps are usually associated to positive predictions
measures that compare trackers’ predicted bounding boxes in many computer vision tasks. Overall, our proposed plot
with the temporally aligned ground-truth boxes. To evaluate allows to assess the length of tracking sequences in a more
the overall localization accuracy of the trackers, we employ general way that is better aligned with the requirements of dif-
the success plot (Wu et al., 2015), which shows the percent- ferent application scenarios including FPV ones. Similarly to
age of predicted boxes whose IoU with the ground-truth is Wu et al. (2015); Müller et al. (2018), we use the AUC of the
larger than a threshold varied from 0 to 1 (Fig. 6a). We also generalized success robustness plot to obtain an aggregate
use the normalized precision plot (Müller et al., 2018), that score which we refer to as generalized success robustness
reports, for a variety of thresholds, the percentage of boxes (GSR).
whose center points are within a given normalized distance
123
International Journal of Computer Vision (2023) 131:259–283 273
6 Results
123
274 International Journal of Computer Vision (2023) 131:259–283
longer periods of time. Trackers that aim to build robust target tary document) that implement target association on top of
models via online methods (e.g. STMTrack, ECO, TrDiMP, object detection (Bewley et al., 2016; Dave et al., 2020). A
VITAL, MDNet, ATOM) result in better solutions for keep- simple application of SORT (Bewley et al., 2016) does not
ing longer temporal reference to objects. Particularly, the work as well as demonstrated in other domains (Dave et al.,
results achieved by STMTrack tell that a strategy based on 2020), and applying such method in combination with the
memory networks building a highly dynamic representation strategy described in Sect. 4.2.1 brings little benefit.
of the template during tracking is beneficial to maintain a Figures 6, 16 and 7, 17 also show the performances of the
longer reference to the target. other FPV baselines LTMU-F and LTMU-H in comparison
By comparing the performance of the selected trackers with the different trackers. In both the OPE and MSE exper-
with the results they achieve on standard benchmarks such iments, the proposed trackers achieve the top spots in the SS
as OTB-100 (Wu et al., 2015), as reported in Fig. 18 of the and NPS rankings, while they lose some performance in the
supplementary document, it can be noticed that the overall GSR score. Table 5 shows the performance gain in applying
performance of all the trackers is decreased across all mea- the LTMU-F/H scheme over different generic object trackers
sures when considering the FPV scenario. Considering the (Dai et al., 2020; Fu et al., 2021; Yan et al., 2021). Overall,
extended usage of data driven approaches (e.g. deep learning) both LTMU-F and LTMU-H increase the SS and NPS metrics
in visual tracking nowadays, we assessed the impact of lever- of the underlying tracker, with the second presenting a gen-
aging large-scale FPV object localization data for training. erally larger improvement. In the versions with STARK and
In-depth discussion and results are provided in Section 11.3 STMTrack, the GSR scores are decreased. However, looking
of the supplementary document. In short, some methodolo- at the DiMP-MU version (as used in Dai et al. (2020)) we see
gies such as deep discriminative trackers (Bhat et al., 2019) that the performance is improved by a good margin in all the
benefit from FPV-specific data, but the overall tracking per- metrics, including the GSR. Considering that such an under-
formance still does not reach the quality that is observed in lying tracker uses a MetaUpdater (Dai et al., 2020) to better
more common tracking benchmarks (Wu et al., 2015; Mueller assess the consistency of the tracker in triggering re-detection
et al., 2016; Galoogahi et al., 2017; Kristan et al., 2019). Other and model update, we hypothesize that such a module could
methodologies such as siamese network-based trackers (Li et bring benefit to the other versions if properly customized
al., 2019) and transformer-based trackers (Yan et al., 2021) to. Fig. 20 of the supplementary document presents some
are not able to exploit the context of FPV from still FPV qualitative examples of the performance of the LTMU-F/H
images. This weakness could be improved by yet-to-come trackers in contrast to the baseline one. Overall, the message
large-scale FPV tracking datasets. Overall, these outcomes to take from these outcomes is that adapting a state-of-the-art
demonstrate that, for the current availability of tracking data method with FPV-specific components allows to increase the
as well as the visual tracking knowledge in exploiting such, tracking performance. Combining hand and object tracking,
the FPV setting poses new challenges to present trackers. It is as the baseline LTMU-H naïvely does, results a promising
worth mentioning that our achieved conclusions are consis- direction. We hence expect significant performance improve-
tent with the demonstrated performance drop of other object ments to be achievable by a tracker accurately designed to
localization models (e.g. object detection) exploited between exploit FPV-specific cues such as the characteristics of the
classical domains (Everingham et al., 2015; Lin et al., 2014) interaction between the target and the camera wearer.
and FPV domains (Damen et al., 2018).
6.3 Initialization by an Object Detector
6.2 Performance of the FPV-Specific Trackers
Figures 9, 19 report the SS, NPS, GSR performance change
The results achieved by the proposed TbyD-F and TbyD- when the EK-55 Faster-R-CNN (Damen et al., 2018) or the
H FPV-based tracking-by-detection baselines are compared HiC (Shan et al., 2020) detection bounding box is used to
with the generic object trackers in Figs. 6, 16 and 7, 17. initialize the trackers. In general, such a process causes a
As can be noticed, the baselines have competitive results drop in the tracking performance. This can be explained by
with the best trackers in the SS and NPS metrics, but they the noise in the position and scale of the initial target state
struggle in the GSR. This means that they are not able that consequently affects the constructions of the models that
to maintain reference to the objects even though the other are used for tracking during the video (Wu et al., 2015). By
scores suggest they provide spatially accurate localizations. computing the average delta across the trackers for each of
By comparing TbyD-F with TbyD-H, we observe that the the metrics, we obtain that Faster-R-CNN causes SS, NPS,
second is better in an OPE-like execution scenario, while the GSR drops of −5.3%, −5.1%, −3.1%. HiC leads to slightly
first achieves higher scores in the MSE experiments. Table larger drops of −5.9%, −5.7%, −4.4%. It is worth mention-
4 reports the performance of such two trackers with other ing that Faster-R-CNN provided 149 valid detections out of
strategies (details are given in Section 9.1 of the supplemen- 150 with an average delay of 14 frames from the start of the
123
International Journal of Computer Vision (2023) 131:259–283 275
Fig. 8 Qualitative results of some of the generic object trackers benchmarked on the proposed TREK-150 dataset
sequence, while HiC gave 146 valid detections with a delay H achieves SS 0.440, while LTMU-H, STARK, and TransT,
of 28 frames. Hence, HiC is a weaker object detector. Overall, achieve SS 0.478, 0.470, 0.466, respectively.
we consider the average performance drop quite limited, thus
making the trackers usable even in cases of noisy initializa-
tion. TbyD-F/H are among the trackers losing less accuracy, 6.4 Attribute Analysis
but despite this their performance does not surpass track-
ers more susceptible to noise, such as LTMU-F/H, STARK, Figure 10 reports the SS, NPS, and GSR scores, computed
TransT. Indeed, when initialized by Faster-R-CNN, TbyD- with the MSE protocol, of the 20 representative trackers
with respect to the attributes introduced in Table 2. We do
not report results for the POC attribute as it is present in
123
276 International Journal of Computer Vision (2023) 131:259–283
123
International Journal of Computer Vision (2023) 131:259–283 277
Fig. 11 SS, NPS, and GSR performance achieved under the MSE pro- Fig. 12 SS, NPS, and GSR performance achieved under the MSE pro-
tocol of 20 of the 42 selected trackers with respect to the action verbs tocol of 20 the 42 selected trackers with respect to the target noun
performed by the camera wearer and available in TREK-150. The red categories available in TREK-150. The red plain line highlights the
plain line highlights the average performance average tracker performance
123
278 International Journal of Computer Vision (2023) 131:259–283
6.5 Action Analysis achieved by the tracker run in an OPE-like fashion on the
same sub-sequences on which the pipeline is executed. It
The plot in Fig. 11 reports the MSE protocol results of SS, can be noticed how the performance difference between the
NPS, and GSR with respect to the action verb labels associ- trackers is reduced with respect to what showed in Figs. 6
ated to the actions performed by the camera wearer in each and 16. This demonstrates that when deployed for HOI, the
video sequence. We think that the results presented in the different tracking methodologies lead to an overall similar
following can give cues about the exploitation of trackers pipeline. Particularly, it results that STARK is a better suited
for action recognition tasks. In general, we observe that the methodology for tracking objects starting from an initializa-
actions mainly causing a spatial displacement of the target tion given by an object detection algorithm in this context.
(e.g. “move”, “store”, “check”) have less impact on the per- By comparing the Recall with the tracker performance scores
formance of the trackers. Instead, actions that change the (SS, NPS, GSR), it can be noted that there is a correlation
state, shape, or aspect ratio of the target object (e.g. “remove”, between the first and the SS, since the ranking of the trackers
“squeeze”, “cut”, “attach”) generate harder tracking scenar- according to the first measure is very similar to the one of
ios. Also the sequences characterized by the “wash” action the second measure.
verb lead trackers to poor performance. Indeed, such an In Table 8 of the supplementary document, the results of an
action makes the object harder to track because of the many oracle-based solution that gives the optimal bounding box for
occlusions caused by the persistent and severe manipula- the interacted object at the first frame of HOI are presented.
tion washing involves. It can be noted from the plots that no The first thing that stands out is the performance gap with
tracker prevails overall, but LTMU-F/H, STARK, and TransT respect to what reported in Tables 6 and 7. This is due to
occupy the top stops especially in the plots relative to SS and the performance of HiC which struggles to find a valid HOI
NPS. In general, the performance of the trackers varies much detection in the proposed video-based pipeline. This issue
across the different actions showing that various approaches delays the initialization of the tracker making the overall
are suitable to track under the different conditions generated. pipeline not detecting and localizing the HOI in many frames.
The plots in Fig. 12 presents the performance scores of These outcomes show that, if initialized with a proper bound-
the trackers with respect to the target noun labels, i.e. the ing box for the object involved in the interaction, the trackers
categories of target object. Rigid, regular-sized objects such are able to maintain the spatial and temporal reference to
as “pan”, “kettle”, “bowl”, “plate”, and “bottle” are among such an object for all the interaction period with promising
the ones associated with higher average SS greater or around accuracy. Indeed, the Recall values achieved by the proposed
0.5, but some of them (e.g. “plate” and “bottle”) lead to lower HOI system with LTMU-H reaches 0.754. It is also worth
GSR scores meaning that trackers provide a spatially accu- observing that the SS, NPS, GSR scores achieved in this
rate but short temporal reference to such kind of objects. In experiment reflect the performance achieved by the trackers
contrast, other rigid objects such as “knife”, “spoon”, “fork” with the OPE protocol on the full sequences of TREK-150,
and “can” are more difficult to track from the point of view as reported in Figs. 6 and 16. These results demonstrate that
of all the considered measures (the scores are around 0.3 or the evaluation of the trackers’ performance on the original
lower). This is probably due to the particularly thin shape of sequences of TREK-150 can lead to conclusions about the
these objects and the light reflectance they are easily subject behavior of the trackers in particular FPV application scenar-
to. Deformable objects such as “sponge”, “onion”, “cloth” ios. Furthermore, the reader might wonder why there is such
and “rubbish” are in general also difficult to track. a large absolute difference in the values of the SS, NPS, and
GSR present in Table 8 and those in the brackets of Fig. 16.
6.6 Hand-Object Interaction Evaluation This can be explained by the fact that in the considered HOI
evaluation the lengths of the video sequences are very short
Tables 6 and 7 present the results of the evaluation of the (the average length is of 81 frames). In contrast, the average
HOI task described in Sect. 5.1.3 in relation to the considered length of the full video sequences present in TREK-150 is 649
trackers. Despite we are showing that FPV introduces chal- frames, which is much higher than the previously discussed
lenges for current trackers, with this experiment we want to number. Such a shorter duration of the videos simplifies the
assess whether they can be still exploited in the FPV domain job of the trackers since the variations of the target object
to obtain information about the objects’ locations and move- and the scene are less significant in these conditions rather
ments in the scene (Furnari et al., 2017; Furnari & Farinella, than in longer sequences. A justification to this explanation is
2020; Sener et al., 2020; Shan et al., 2020; Wang et al., 2020). also given by the GSR results of Figs. 6 and 16. For example,
The results given in the first column of the table report the on such measure, STARK achieves 0.395 which means that
Recall of the proposed video-based HOI detection pipeline such an algorithm tracks successfully until the 39.5% of a
in which each tracker is included. The values in the brackets sequence length. In number of frames, such a fraction is 256
of the second column report the SS, NPS, and GSR results on average. This value is much higher than the length of the
123
International Journal of Computer Vision (2023) 131:259–283 279
Table 6 Results of the experiment in which 20 of the considered track- interaction detector HiC (Shan et al., 2020) which processes
ers are evaluated by the Recall of an FPV HOI detection pipeline where the frames independently. This solution achieves a Recall
trackers are used as localization method for the object involved in the
interaction
of 0.113 which results very low when compared to the
0.248, 0.246, and 0.245 achieved by the pipelines exploit-
Tracker Recall (SS, NPS, GSR)
ing STARK, LTMU-H, TransT, respectively.
STARK 0.248 (0.211, 0.221, 0.222) In addition, we compared the performance the EK-55-
LTMU-H 0.246 (0.210, 0.222, 0.217) trained Faster R-CNN (Damen et al., 2018) and HiC (Shan
LTMU-F 0.245 (0.210, 0.221, 0.216) et al., 2020) when used as pure object detectors (not exploit-
TbyD-H 0.238 (0.205, 0.223, 0.163) ing temporal information for tracking as in the TbyD-F/H
LightTrack 0.233 (0.197, 0.212, 0.228) baselines). In this case, for Faster-R-CNN, at every frame,
KeepTrack 0.232 (0.201, 0.214, 0.212) we consider as output the bounding box having the highest
SiamRPN++ 0.227 (0.191, 0.206, 0.209) score associated to the category of the target object in the
TbyD-F 0.220 (0.184, 0.202, 0.179)
video, while for HiC we just take the object bounding box
STMTrack 0.216 (0.196, 0.202, 0.219)
having the largest score (HiC provides class-agnostic object
detections). On the sequences of TREK-150 the first solu-
D3S 0.211 (0.187, 0.199, 0.208)
tion achieves an OPE-based SS, NPS, and GSR of 0.323,
ECO 0.211 (0.181, 0.196, 0.217)
0.369, 0.044 respectively, and runs at 1 FPS, while the sec-
DiMP 0.210 (0.186, 0.198, 0.211)
ond reaches SS 0.411, NPS 0.438, GSR 0.007, at 8 FPS.
ATOM 0.207 (0.186, 0.198, 0.213)
Comparing these results with those of the TbyD-F/H base-
VITAL 0.198 (0.178, 0.192, 0.213)
lines, we see the advantage of performing tracking, since all
SiamFC 0.195 (0.171, 0.180, 0.195)
the metric scores are improved. Moreover, if we compare
GlobalTrack 0.195 (0.170, 0.180, 0.144) the detectors’ results with the ones presented in the overall
BACF 0.188 (0.170, 0.189, 0.206) study, we clearly notice that trackers, even when initialized
Staple 0.182 (0.164, 0.179, 0.204) by a detection module, can deliver faster, more accurate, and
MOSSE 0.158 (0.151, 0.154, 0.188) much temporally longer object localization than detectors.
GOTURN 0.139 (0.138, 0.147, 0.162) Overall, these outcomes demonstrate that visual object
The first column presents the results of the proposed system in which trackers can bring benefits to FPV application pipelines. In
each tracker is initialized with the bounding box given by HiC in its addition to the ability of maintaining reference to specific
first valid HOI detection. The last column reports the SS, NPS, and object instances, the advantages of tracking are achieved in
GSR results achieved by each tracker with the OPE protocol on the
sub-sequences yielded by the HOI labels. Best results, per measure, are terms of better object localization and efficiency. We hence
highlighted in Bold, second-best in Bolditalic, third-best in Italic expect that trackers will likely gain more importance in FPV
as new methodologies explicitly considering the first person
point of view are investigated.
123
280 International Journal of Computer Vision (2023) 131:259–283
visual object tracking benchmarks. This is explained by the improvement of the performance of visual object trackers in
different nature of images and the particular characteristics this new domain.
introduced by FPV which offer new and challenging condi-
tions for the current knowledge in the visual tracking domain Supplementary Information The online version contains supplemen-
tary material available at https://doi.org/10.1007/s11263-022-01694-
and the lack of tracking-specific FPV data. The analysis
6.
revealed that deep learning-based trackers employing online
adaptation techniques achieve better performance than the Funding Open access funding provided by Universitá degli Studi di
trackers based on siamese neural networks or on handcrafted Udine within the CRUI-CARE Agreement.
features. Among the different methodologies based on this
Open Access This article is licensed under a Creative Commons
approach, the transformer-based worked the best and hence is Attribution 4.0 International License, which permits use, sharing, adap-
a promising future direction. This exploration could involve tation, distribution and reproduction in any medium or format, as
the curation of large-scale diverse tracking-specific data. long as you give appropriate credit to the original author(s) and the
The introduction of FPV-specific object localization mod- source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
ules, such as HOI models, in a tracking pipeline increased in this article are included in the article’s Creative Commons licence,
its performance, demonstrating that particular cues about unless indicated otherwise in a credit line to the material. If material
the domain influence the tracking accuracy. These results is not included in the article’s Creative Commons licence and your
highlighted the potential direction of joint hand-object track- intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copy-
ing, and we expect successful methodologies to take into right holder. To view a copy of this licence, visit http://creativecomm
account also cues about the camera wearer’s surroundings. ons.org/licenses/by/4.0/.
The performance of the trackers was then studied in relation
to specific attributes characterising the visual appearance of
the target and the scene. It turned out that the most chal-
References
lenging factors for trackers are the target’s out of view, its
full occlusions, its low resolution, as well the presence of Aghaei, M., Dimiccoli, M., & Radeva, P. (2016). With whom do i
similar objects or of fast motion in the scene. Trackers were interact?. In ICPR: Detecting Social Interactions in Egocentric
also analyzed based on the action performed by the camera Photo-Streams.
Aghaei, M., Dimiccoli, M., & Radeva, P. (2016). Multi-face tracking by
wearer as well as the object category the target belongs to.
extended bag-of-tracklets in egocentric photo-streams. Computer
It resulted that actions causing the change of state, shape, or Vision and Image Understanding, 149, 146–156.
aspect ratio of the target affected the trackers more than the Alletto, S., Serra, G., & Cucchiara, R. (2015). Egocentric object track-
actions causing only spatial changes. We think that track- ing: An odometry-based solution. In ICIAP.
Bertasius, G., Park, H. S., Yu, S. X., & Shi, J. (2017a). First-person
ers incorporating semantic information about the person’s
action-object detection with egonet. In Robotics: Science and Sys-
action could be an interesting direction of investigation. We tems.
observed that rigid thin-shaped objects are among the hard- Bertasius, G., Soo Park, H., Yu, S. X., & Shi, J. (2017). Unsupervised
est ones to track. Finally, we evaluated the trackers in the learning of important objects from first-person videos. In ICCV.
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P. H. (2016).
context of the FPV-specific application of video-based hand-
Staple: Complementary learners for real-time tracking. In CVPR.
object interaction detection. We included each tracker in a Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H.
pipeline to tackle such a problem, and evaluated the perfor- (2016). Fully-convolutional siamese networks for object tracking.
mance of the system to quantify the tracker’s contribution. ECCVW.
Bewley, A., Ge, Z., Ott, L., Ramos, F.,& Upcroft, B. (2016). Simple
We observed that the trackers demonstrate a behavior that is
online and realtime tracking. In ICIP.
consistent with their overall performance on the sequences of Bhat, G., Danelljan, M., Van Gool, L., & Timofte, R. (2020). Know your
TREK-150. Even though FPV introduced challenging factors surroundings: Exploiting scene information for object tracking. In
for trackers, the results in such a specific task demonstrated ECCV.
Bhat, G., Danelljan, M., Van Gool, L., Timofte, R. (2019). Learning
that current trackers can be used successfully if the video
discriminative model prediction for tracking. In ICCV.
sequences in which tracking is required are not too long. We Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010).
also demonstrated that trackers bring advantages in terms of Visual object tracking using adaptive correlation filters. In CVPR.
object referral and localization, and efficiency, over object Cai, M., Kitani, K. M., & Sato, Y. (2016). Understanding hand-object
manipulation with grasp types and object attributes. In Robotics:
detection. We think that an effective and efficient integration
Science and Systems.
of tracking methodologies with those of FPV downstream Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J. (2020). Reconstruct-
applications is a relevant problem to study. In conclusion, ing hand-object interactions in the wild. arXiv .
we believe that there is potential in improving FPV pipelines Čehovin, L., Kristan, M., & Leonardis, A. (2013). Robust visual tracking
using an adaptive coupled-layer visual model. IEEE Transactions
by employing visual trackers as well as there is room for the
on Pattern Analysis and Machine Intelligence, 35(4), 941–953.
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., Lu, H. (2021). Trans-
former tracking. In CVPR.
123
International Journal of Computer Vision (2023) 131:259–283 281
Chen, Z., Zhong, B., Li, G., Zhang, S.,& Ji, R. (2020). Siamese box Furnari, A., & Farinella, G. M. (2019). What would you expect?. In
adaptive network for visual tracking. In CVPR. ICCV: Anticipating Egocentric Actions With Rolling-Unrolling
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of LSTMs and Modality Attention.
non-rigid objcts using mean shift. In CVPR. Furnari, A., Battiato, S., Grauman, K., & Farinella, G. M. (2017). Next-
Dai, K., Zhang, Y., Wang, D., Li, D., Lu, H., & Yang, X. (2020). High- active-object prediction from egocentric videos. Journal of Visual
performance long-term tracking with meta-updater. In CVPR. Communication and Image Representation, 49, 401–411.
Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kaza- Furnari, A., & Farinella, G. (2020). Rolling-unrolling LSTMs for action
kos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, anticipation from first-person video. IEEE Transactions on Pattern
M. (2018). Scaling egocentric vision: The epic-kitchens dataset. Analysis and Machine Intelligence, 43(11), 4021–4036.
In ECCV. Galoogahi, H. K., Fagg, A., Huang, C., Ramanan, D., & Lucey, S.
Damen, D., Doughty, H., Farinella, G. M., Furnari, A., Kazakos, E., Ma, (2017). Need for speed: A benchmark for higher frame rate object
J., et al. (2021). Rescaling egocentric vision: Collection, pipeline tracking. In ICCV.
and challenges for epic-kitchens-100. International Journal of Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari, A., Girdhar,
Computer Vision, 130(1), 33–55. R., Hamburger, J., Jiang, H., Liu, M., Liu, X., Martin, M., Nagara-
Damen, D., Leelasawassuk, T., & Mayol-Cuevas, W. (2016). You-do, jan, T., Radosavovic, I., Ramakrishnan, S. K., Ryan, F., Sharma,
i-learn: Egocentric unsupervised discovery of objects and their J., et al. (2022). Ego4d: Around the World in 3000 h of egocentric
modes of interaction towards video-based guidance. Computer video. In CVPR.
Vision and Image Understanding, 149, 98–112. Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021).
Danelljan, M., Bhat, G., Khan, F.S., & Felsberg, M. (2017a). ECO: Graph attention tracking. In CVPR.
Efficient Convolution Operators for Tracking. In CVPR. Han, S., Liu, B., Cabezas, R., Twigg, C. D., Zhang, P., Petkau, J., et al.
Danelljan, M., Bhat, G., Khan, F.S., & Felsberg, M. (2019). ATOM: (2020). MEgATrack: Monochrome egocentric articulated hand-
Accurate Tracking by Overlap Maximization. In CVPR. tracking for virtual reality. ACM Transactions on Graphics, 39(4),
Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2017b). Dis- 87–1.
criminative scale space tracking. IEEE Transactions on Pattern Hare, S., Golodetz, S., Saffari, A., Vineet, V., Cheng, M. M., Hicks,
Analysis and Machine Intelligence,39(8), 1561–1575. S. L., & Torr, P. H. (2016). Struck: Structured output tracking
Danelljan, M., Van Gool, L., & Timofte ,R. (2020). Probabilistic Regres- with kernels. IEEE Transactions on Pattern Analysis and Machine
sion for Visual Tracking. In CVPR. Intelligence, 38(10), 2096–2109.
Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D. (2020). Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100
Tao: A large-scale benchmark for tracking any object. In ECCV. FPS with deep regression networks. InECCV .
De la Torre, F., Hodgins, J. K., Montano, J., Valcarcel, S. (2009). Henriques, J. F., Caseiro, R., Martins, P., & Batista, J. (2015). High-
Detailed human data acquisition of kitchen activities: The cmu- speed tracking with kernelized correlation filters. IEEE Trans-
multimodal activity database (cmu-mmac). In Workshop on Devel- actions on Pattern Analysis and Machine Intelligence, 37(3),
oping Shared Home Behavior Datasets to Advance HCI and 583–596.
Ubiquitous Computing Research, in conjuction with CHI. Huang, L., Zhao, X., & Huang, K. (2020). GlobalTrack: A simple and
Dendorfer, P., Osep, A., Milan, A., Schindler, K., Cremers, D., Reid, strong baseline for long-term tracking. In AAAI.
I., et al. (2021). Motchallenge: A benchmark for single-camera Huang, L., Zhao, X., & Huang, K. (2019). GOT-10k: A large high-
multiple target tracking. International Journal of Computer Vision, diversity benchmark for generic object tracking in the wild. IEEE
129(4), 845–881. Transactions on Pattern Analysis and Machine Intelligence, 43(5),
Deng, J., Dong, W., Socher, R., Li, L., Kai L., & Fei-Fei L. (2009). 1562–1577.
Imagenet: A large-scale hierarchical image database. In CVPR. Kalal, Z., Mikolajczyk, K., & Matas, J. (2012). Tracking-learning-
Dunnhofer, M., Martinel, N., Foresti, G.L., Micheloni, C. (2019). Visual detection. IEEE Transactions on Pattern Analysis and Machine
Tracking by means of Deep Reinforcement Learning and an Expert Intelligence, 34(7), 1409–1422.
Demonstrator. In ICCVW. Kapidis, G., Poppe, R., Van Dam, E., Noldus, L., & Veltkamp, R. (2019).
Dunnhofer, M., Martinel, N., Micheloni, C. (2020). Tracking-by- Egocentric hand track and object-based human action recognition.
Trackers with a Distilled and Reinforced Model. In ACCV. In IEEE SmartWorld, Ubiquitous Intelligence and Computing,
Dunnhofer, M., Martinel, N., & Micheloni, C. (2021). Weakly- Advanced and Trusted Computing, Scalable Computing and Com-
supervised domain adaptation of deep regression trackers via munications, Internet of People and Smart City Innovation.
reinforced knowledge distillation. EEE Robotics and Automation Kapidis, G., Poppe, R., Van Dam, E., Noldus, L., & Veltkamp, R. (2019).
Letters, 6(3), 5016–5023. Egocentric hand track and object-based human action recogni-
Everingham, M., Eslami, S., Van Gool, L., Williams, C. K., Winn, J., & tion. In IEEE SmartWorld, Ubiquitous Intelligence & Computing,
Zisserman, A. (2015). The pascal visual object classes challenge: Advanced & Trusted Computing, Scalable Computing & Commu-
A retrospective. International Journal of Computer Vision, 111(1), nications, Cloud & Big Data Computing, Internet of People and
98–136. Smart City Innovation.
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Kiani Galoogahi, H., Fagg, A., & Lucey, S. (2017). Learning
Liao, C., & Ling, H. (2019). LaSOT: A High-quality Benchmark background-aware correlation filters for visual tracking. In CVPR.
for Large-scale Single Object Tracking. In CVPR. Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder,
Fan, H., Miththanthaya, H. A., Harshit, S. R. Rajan, L. X., Zou, Z., Lin, R., Kämäräinen, J.K., Danelljan, M., Zajc, L.C., Lukezic,
Y., & Ling, H. (2021). Transparent object tracking benchmark. In A., Drbohlav, O., He, L., Zhang, Y., Yan, S., Yang, J., Fernán-
ICCV. dez, G., et al. (2020). The eighth visual object tracking vot2020
Fan, H., Bai, H., Lin, L., Yang, F., Chu, P., Deng, G., et al. (2021). Lasot: challenge results. In ECCVW.
A high-quality large-scale single object tracking benchmark. Inter- Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Zajc,
national Journal of Computer Vision, 129(2), 439–461. L. C., Vojir, T., Hager, G., Lukezic, A., Eldesokey, A., Fernandez,
Fu, Z., Liu, Q., Fu, Z., & Wang, Y. (2021). Stmtrack: Template-free G., et al. (2017). The visual object tracking VOT2017 challenge
visual tracking with space-time memory networks. In CVPR. results. In ICCVW.
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder,
R., Kämäräinen, J. K., Chang, H. J., Danelljan, M., Cehovin,
123
282 International Journal of Computer Vision (2023) 131:259–283
L., Lukezic, A., Drbohlav, O., Käpylä, J., Häger, G., Yan, S., Yang, Nam, H., Hong, S., & Han, B. (2014). Online graph-based tracking. In
J., Zhang, Z., & Fernández, G. (2021). The ninth visual object ECCV.
tracking vot2021 challenge results. In ICCVW. Nigam, J., & Rameshan, R.M. (2017). EgoTracker: Pedestrian tracking
Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pflugfelder, R., with re-identification in egocentric videos. In CVPRW.
Kämäräinen, J.K., Zajc, L., Drbohlav, O., Lukežič, A., Berg, Park, E., & Berg, A. C. (2018). Meta-tracker: Fast and robust online
A., Eldesokey, A., Käpylä, J., Fernández, G., et al. (2019). The sev- adaptation for visual object trackers. In ECCV.
enth visual object tracking VOT2019 challenge results. In ICCVW. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily
Kristan, M., Matas, J., Leonardis, A., Vojíř, T., Pflugfelder, R., Fernán- living in first-person camera views. In CVPR.
dez, G., et al. (2016). A novel performance evaluation methodology Ragusa, F., Furnari, A., Livatino, S., & Farinella, G. M. (2020). The
for single-target trackers. IEEE Transactions on Pattern Analysis meccano dataset: Understanding human-object interactions from
and Machine Intelligence, 38(11), 2137–2155. egocentric videos in an industrial-like domain. In WACV.
Li, M., Wang, Y. X., & Ramanan, D. (2020). Towards streaming percep- Rai, A., Sener, F., Yao, A. (2021). Transformed ROIs for capturing
tion. In: A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm (Eds.), visual transformations in videos. arXiv .
ECCV. Real, E., Shlens, J., Mazzocchi, S., Pan, X.,& Vanhoucke, V. (2017).
Li, Y., Liu, M., & Rehg, J. M. (2018). In the eye of beholder: Joint Youtube-boundingboxes: A large high-precision human-annotated
learning of gaze and actions in first person video. In ECCV. data set for object detection in video. In CVPR.
Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M.H. (2018). Learning Redmon, J., Divvala, S.K., Girshick, R.B., & Farhadi, A. (2016). You
spatial-temporal regularized correlation filters for visual tracking. only look once: Unified, real-time object detection. In IEEE con-
In CVPR. ference on computer vision and pattern recognition, pp. 779–788.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN:
SIAMRPN++: Evolution of siamese visual tracking with very deep Towards real-time object detection with region proposal networks.
networks. In CVPR. In NeurIPS.
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for Rodin, I., Furnari, A., Mavroedis, D., & Farinella, G. M. (2021). Pre-
visual tracking: Algorithms and benchmark. IEEE Transactions on dicting the future from first person (egocentric) vision: A survey.
Image Processing, 24(12), 5630–5644. Computer Vision and Image Understanding, 211, 103252.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). NUS-PRO: Ross, D. A., Lim, J., Lin, R. S., & Yang, M. H. (2008). Incremental
A new visual tracking challenge. IEEE Transactions on Pattern learning for robust visual tracking. International Journal of Com-
Analysis and Machine Intelligence, 38(2), 335–349. puter Vision, 77(1), 125–141.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common et al. (2015). ImageNet large scale visual recognition challenge.
objects in context. In ECCV. International Journal of Computer Vision, 115(3), 211–252.
Liu, M., Tang, S., Li, Y., & Rehg, J. (2020). Forecasting human object Sener, F., Singhania, D., & Yao, A. (2020). Temporal aggregate repre-
interaction: Joint prediction of motor attention and actions in first sentations for long-range video understanding. In ECCV.
person video. In ECCV. Shan, D., Geng, J., Shu, M., & Fouhey, D. F. (2020). Understanding
Lu, X., Ma, C., Ni, B., Yang, X., Reid, I., & Yang, M.H. (2018). Deep human hands in contact at internet scale. In CVPR.
regression tracking with shrinkage loss. In ECCV. Smeulders, A. W. M., Chu, D. M., Cucchiara, R., Calderara, S.,
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., Dehghan, A., & Shah, M. (2014). Visual tracking: An experimen-
& Leibe, B. (2021). Hota: A higher order metric for evaluating tal survey. IEEE Transactions on Pattern Analysis and Machine
multi-object tracking. International Journal of Computer Vision, Intelligence, 36(7), 1442–1468.
129(2), 548–578. Song, Y., Ma, C., Wu, X., Gong, L., Bao, L., Zuo, W., Shen, C., Lau,
Lukezic, A., Zajc, L. C., Vojir, T., Matas, J., & Kristan, M. (2020). R. W., & Yang, M. H. (2018). VITAL: VIsual Tracking via Adver-
Performance evaluation methodology for long-term single-object sarial Learning. In CVPR.
tracking. IEEE Transactions on Cybernetics. Sun, L., Klank, U., & Beetz, M. (2010). EYEWATCHME-3D Hand and
Lukezic, A., Kart, U., Kapyla, J., Durmush, A., Kamarainen, J. object tracking for inside out activity analysis. In CVPRW.
K., Matas, J., Kristan, M. (2019). CDTB: A color and depth visual Valmadre, J., Bertinetto, L., Henriques, J. F., Tao, R., Vedaldi, A.,
object tracking dataset and benchmark. In ICCV. Smeulders, A. W., Torr, P. H., & Gavves, E. (2018). Long-Term
Lukežič, A., Matas, J., & Kristan, M. (2020). D3S: A Discriminative Tracking in the Wild: A benchmark. In ECCV.
Single Shot Segmentation Tracker. In CVPR. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
Ma, M., Fan, H., & Kitani, K. M. (2016). Going deeper into first-person A.N., Kaiser, L.u., & Polosukhin, I. (2017). Attention is all you
activity recognition. In CVPR. need. In NeurIPS.
Maggio, D. E., & Cavallaro, D. A. (2011). Video Tracking: Theory and Visee, R. J., Likitlersuang, J., & Zariffa, J. (2020). An effective and
Practice. Wiley Publishing efficient method for detecting hands in egocentric videos for reha-
Maresca, M. E., & Petrosino, A. (2013). MATRIOSKA: A multi-level bilitation applications. IEEE Transactions on Neural Systems and
approach to fast tracking by learning. In ICIAP. Rehabilitation Engineering, 28(3), 748–755.
Mayer, C., Danelljan, M., Paudel, D. P., & Gool, L. V. (2021). Learning Wang, X., Wu, Y., Zhu, L., & Yang, Y. (2020). Symbiotic attention with
target candidate association to keep track of what not to track . privileged information for egocentric action recognition. In AAAI.
Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Wang, Q., Gao, J., Xing, J., Zhang, M., & Hu, W. (2017). DCFNet:
& Theobalt, C. (2017). Real-time hand tracking under occlusion Discriminant Correlation Filters Network for Visual Tracking.
from an egocentric RGB-D sensor. In ICCVW. Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P. H. S. (2019). Fast
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and sim- online object tracking and segmentation: A unifying approach. In
ulator for UAV tracking. In ECCV. CVPR.
Müller, M., Bibi, A., Giancola, S., Alsubaihi, S., & Ghanem, B. (2018). Wang, N., Zhou, W., Tian, Q., Hong, R., Wang, M., & Li, H. (2018).
TrackingNet: A Large-scale dataset and benchmark for object Multi-cue correlation filters for robust visual tracking. In CVPR.
tracking in the wild. In ECCV. Wang, N., Zhou, W., Wang, J., & Li, H. (2021). Transformer meets
Nam, H., & Han, B. (2016). Learning multi-domain convolutional neu- tracker: Exploiting temporal context for robust visual tracking. In
ral networks for visual tracking. In CVPR . CVPR.
123
International Journal of Computer Vision (2023) 131:259–283 283
Wojke, N., Bewley, A., & Paulus, D. (2018). Simple online and realtime Yan, B., Zhao, H., Wang, D., Lu, H., & Yang, X. (2019). ’Skimming-
tracking with a deep association metric. In ICIP. perusal’ tracking: A framework for real-time and robust long-term
Wu, C. Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., & tracking. In ICCV.
Girshick, R. (2019). Long-term feature banks for detailed video Yun, S., Choi, J., Yoo, Y., Yun, K., & Choi, J. Y. (2017). Action-decision
understanding. In CVPR. networks for visual tracking with deep reinforcement learning. In
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A bench- CVPR.
mark. In CVPR. Zhang, L., Gonzalez-Garcia, A., Weijer, J. V. D., Danelljan, M., & Khan,
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. F. S. (2019). Learning the model update for siamese trackers. In
IEEE TPAMI . ICCV.
Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). SiamFC++: Towards Zhang, J., Ma, S., & Sclaroff, S. (2014). MEEM: Robust tracking via
robust and accurate visual tracking with target estimation guide- multiple experts using entropy minimization. In ECCV.
lines. In AAAI. Zhang, Z., Peng, H., Fu, J., Li, B. & Hu, W. (2020). Ocean: Object-aware
Xu, N., Yang, L., Fan, Y., Yang, J., Yue, D., Liang, Y., Price, B., Cohen, anchor-free tracking. In ECCV.
S. & Huang, T. (2018). Youtube-vos: Sequence-to-sequence video Zhang, Z., & Peng, H. (2019). Deeper and wider siamese networks for
object segmentation. In ECCV. real-time visual tracking. In CVPR .
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-
temporal transformer for visual tracking. In ICCV.
Yan, B., Peng, H., Wu, K., Wang, D., Fu, J., & Lu, H. (2021). Lighttrack: Publisher’s Note Springer Nature remains neutral with regard to juris-
Finding lightweight neural networks for object tracking via one- dictional claims in published maps and institutional affiliations.
shot architecture search. In CVPR.
123