Visual Object Tracking
Instructor: Seunghoon Hong
Visual object tracking
Objective: locating the object(s) over time in a video
Initial frame
Target Tracking over
Visual Tracking
Visual object tracking
Objective: locating the object(s) over time in a video
Formal definition: given an object state at the initial frame z0=(x0,y0,w0,h0),
identify z1:T={z1,z2,…,zT} over a video of length T.
Visual object tracking
Objective: locating the object(s) over time in a video
Formal definition: given an object state at the initial frame z0=(x0,y0,w0,h0),
identify z1:T={z1,z2,…,zT} over a video of length T.
In learning perspective:
● Classification problem with a single object class (= target vs distractors)
● Labeled data is given at only the initial frame
● Optionally requires online learning to adapt the variations in a video
● Online learning is driven by a self-supervision (training data = tracking results)
Visual object tracking
Objective: locating the object(s) over time in a video
Formal definition: given an object state at the initial frame z0=(x0,y0,w0,h0),
identify z1:T={z1,z2,…,zT} over a video of length T.
Two sub-categories:
● Single target tracking
○ Tracking only one object in an video
○ Single-class classification (target vs. distractors)
● Multi target tracking
○ Tracking multiple objects in a video
○ Multi-class classification (target 1 vs. target 2 vs. target 3 vs. … vs. distractors)
Visual object tracking
Objective: locating the object(s) over time in a video
Formal definition: given an object state at the initial frame z0=(x0,y0,w0,h0),
identify z1:T={z1,z2,…,zT} over a video of length T.
Two sub-categories:
● Single target tracking
○ Tracking only one object in an video
○ Single-class classification (target vs. distractors)
● Multi target tracking
○ Tracking multiple objects in a video
○ Multi-class classification (target 1 vs. target 2 vs. target 3 vs. … vs. distractors)
Approaches in single object tracking
● Probabilistic tracking
○ Formulate the localization task as a sequential probabilistic inference problem
○ Given a probability of the initial target location, propagate it over the remaining frames
Approaches in single object tracking
● Probabilistic tracking
○ Formulate the localization task as a sequential probabilistic inference problem
○ Given a probability of the initial target location, propagate it over the remaining frames
● Discriminative tracking
○ Classify the object from the distractors at every frame
○ Can be considered as sequential binary object detection (class = target, background)
Probabilistic tracking
● Tracking as a Bayesian network
Bayes Rule
z: object location (state)
x: frame (observation)
Likelihood Prior
Posterior
the measurement of The belief of object state
the probability of
how likely the without observation
object state given
observation
an observation
coincide with the
given state
Probabilistic tracking
● Tracking as a Bayesian network
Bayes Rule
z: object location (state)
x: frame (observation)
Target template
Prior
1 The belief of object state
without observation
2 3 Where is the target
likely to exist?
Probabilistic tracking
● Tracking as a Bayesian network
Bayes Rule
z: object location (state)
x: frame (observation)
Target template
Likelihood
the measurement of
how likely the
observation
coincide with the
given state
Which region of
image look similar
to the target?
Probabilistic tracking
● Tracking as a Bayesian network
Bayes Rule
z: object location (state)
x: frame (observation)
Target template
Posterior
the probability of
object state given
an observation
Where is the object
in this frame?
Probabilistic tracking
● Tracking as a Bayesian network
Bayes Rule
Sequential Bayesian filtering
z1:T: object locations in frame 1 to T
x1:T: frames 1 to T
Probabilistic tracking
● Hidden Markov Model
● Markovian assumption
Probabilistic tracking
● Sequential Bayesian filtering
Integration over all object locations!
Likelihood Prior
Likelihood Transition Posterior upto
model the previous frame
Probabilistic tracking
● Approximation by Monte Carlo sampling
where
Probabilistic tracking
● Particle filtering (Sequential Markov-Chain Monte-Carlo)
○ Approximate the prior distribution using Markov-Chain Monte Carlo (MCMC) sampling
Probabilistic tracking pipeline
Frame t-1 Frame t
2. Move samples by
1. Extract samples transition model 3. Re-evaluate likelihood
proportional to using appearance model
previous posterior
Probabilistic tracking pipeline
Frame t
Tracking procedure (simplified):
1. Sample target states near the previous
target location
2. Evaluate the likelihood based on
appearance model
Example target
appearance model
3. Select the most probable sample as the
target at the current frame
4. Update the target appearance model
using the current tracking results
Attendance check
https://forms.gle/rGpXxLKZ4jbcArid8
Discriminative tracking pipeline
Quick overview: learning tracking-by-detection
● Objective: a ridge regression
Model parameters
Training Training data
labels
Quick overview: learning tracking-by-detection
● Objective: a ridge regression
How do we solve it?
Quick overview: learning tracking-by-detection
● Objective: a ridge regression
We should update this classifier for every frames
(i.e. every time we perform tracking and
get positive/negative samples)
Can we make it faster?
Correlation filtering
● We can make it extremely fast for certain positive/negative sets!
Negative samples
(translated samples)
+30 +15 -15 -30
Base sample
(tracking results)
Correlation filtering
● Representing positive/negative images using circulant matrices
Consider base sample x as n-dimensional array
Circulant matrix
Positive sample
Negative samples
Correlation filtering
● Any circulant matrices can be made diagonal by the Discrete Fourier Transform
(DFT)
DFT matrix
(constant,
independent to x)
DFT of base sample
Correlation filtering
● Putting all together
Circulant matrix
Matrix inner-product
Plug into ridge
regression
Kernelized Correlation filtering
● Easy to extend to kernelized version
ridge regression
ridge regression with
kernel
We can do fast
computation if kernel
matrix K is circulant matrix
Fortunately, it has been
shown that most useful
kernels are circulant[1]
[1] Henriques et al., High-Speed Tracking with Kernelized Correlation Filters, In TPAMI, 2015
Challenges
● Modeling severe appearance variations in a video
figure credit: Li et al., A survey of appearance models in visual object tracking
Modeling appearance for tracking
● Classic: hand-designed features
○ Color histogram
○ Intensity
○ Object Templates
○ Key-points (SIFT)
○ …
● Issue
○ All prone to overfitting
○ Cannot generalize to various appearances
Integrating CNN for appearance modeling
● Benefits
○ Features from a pre-trained CNN can be robust against various appearance changes
○ Especially useful in tracking since we have only one target ground-truth in the initial frame
CNN-based tracking
● CNNTrack: direct application of CNN feature for tracking
CNN-based tracking
● CNNTrack: direct application of CNN feature for tracking
CNN-based tracking
● CNNTrack: direct application of CNN feature for tracking
CNN-based tracking
● CNNTrack: direct application of CNN feature for tracking
CNN-based tracking
● CNNTrack: direct application of CNN feature for tracking
Discussions
● Limitations?
Better representation learning with videos
● MDNet: learn representation for tracking with a large amount of videos
Challenges in visual object tracking
● Temporal drift (i.e. error propagation through time)
○ Drift in posterior estimation: the error in posterior propagates through time
○ Drift in appearance model: if update the appearance model in temporal failure, the error will
propagate
● But why is it so prune to temporal drift?
Summary: Visual tracking
● Object localization in a video
● Probabilistic vs. discriminative tracking
● Modeling target appearance is important
○ Essential to evaluate the affinity of samples in both tracking frameworks
○ Should be able to handle a wide range of appearance variations
○ Should be able to generalize well from a single ground-truth at initial frame
● CNN for visual tracking
○ Applying a pre-trained CNN for feature extraction
○ Training CNN with many heterogeneous videos for tracking