Object detection
Image source
Outline
• Task definition and evaluation
• Two-stage detectors:
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
Object detection evaluation
• At test time, predict bounding boxes, class labels, and confidence
scores
• For each detection, determine whether it is a true or false positive
• PASCAL criterion: Area(GT ∩ Det) / Area(GT ∪ Det) > 0.5
• For multiple detections of the same ground truth box, only one is
considered a true positive
dog: 0.6
dog
dog: 0.55
cat: 0.8 cat
Ground truth (GT)
Object detection evaluation
• At test time, predict bounding boxes, class labels, and confidence
scores
• For each detection, determine whether it is a true or false positive
• For each class, sort detections from highest to lowest confidence,
plot Recall-Precision curve and compute Average Precision
(area under the curve)
• Take mean of AP over classes to get mAP
Precision:
true positive detections /
total detections
Recall:
true positive detections /
total positive test instances
PASCAL VOC Challenge (2005-2012)
• 20 challenge classes:
• Person
• Animals: bird, cat, cow, dog, horse, sheep
• Vehicles: airplane, bicycle, boat, bus, car, motorbike, train
• Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor
• Dataset size (by 2012): 11.5K training/validation images,
27K bounding boxes, 7K segmentations
http://host.robots.ox.ac.uk/pascal/VOC/
Progress on PASCAL detection
PASCAL VOC
Before CNNs
After CNNs
More recent benchmark: COCO
http://cocodataset.org/#home
COCO dataset: Tasks
image classification object detection
semantic segmentation instance segmentation
• Also: keypoint prediction, captioning, question answering…
COCO detection metrics
• Leaderboard: http://cocodataset.org/#detection-leaderboard
• Not updated since 2020
Object detection: Outline
• Task definition and evaluation
• Two-stage detectors
Proposal
Generation
Region Proposals
Image source
R-CNN: Region proposals + CNN features
Source: R. Girshick
SVMs Classify regions with SVMs
SVMs
SVMs Forward each region
through ConvNet
ConvNet
ConvNet
ConvNet
Warped image regions
Region proposals
Input image
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, CVPR 2014
R-CNN details
• Regions: ~2000 Selective Search proposals
• Network: AlexNet pre-trained on ImageNet (1000 classes), fine-tuned
on PASCAL (21 classes)
• Final detector: warp proposal regions, extract fc7 network activations
(4096 dimensions), classify with linear SVM
• Bounding box regression to refine box locations
• Performance: mAP of 53.7% on PASCAL 2010
(vs. 35.1% for Selective Search and 33.4% for Deformable Part Models)
R-CNN pros and cons
• Pros
• Much more accurate than previous approaches!
• Any deep architecture can immediately be “plugged in”
• Cons
• Not a single end-to-end system
• Fine-tune network with softmax classifier (log loss)
• Train post-hoc linear SVMs (hinge loss)
• Train post-hoc bounding-box regressions (least squares)
• Training was slow (84h), took up a lot of storage
• 2000 CNN passes per image
• Inference (detection) was slow (47s / image with VGG16)
Fast R-CNN
Softmax classifier Linear +
softmax Linear Bounding-box regressors
FCs Fully-connected layers
RoI Pooling layer
Region Conv5 feature map of image
proposals
Forward whole image through ConvNet
ConvNet
Source: R. Girshick R. Girshick, Fast R-CNN, ICCV 2015
RoI pooling
• “Crop and resample” a fixed-size feature representing a
region of interest out of the outputs of the last conv layer
• Use nearest-neighbor interpolation of coordinates, max pooling
Conv feature map RoI
pooling
layer
FC layers
…
Region of Interest RoI
(RoI) feature
Source: R. Girshick, K. He
RoI pooling illustration
Image source
Prediction
• For each RoI, network predicts probabilities for 𝐶 + 1 classes
(class 0 is background) and four bounding box offsets for 𝐶
classes
R. Girshick, Fast R-CNN, ICCV 2015
Fast R-CNN training
Log loss + smooth L1 loss Multi-task loss
Linear +
softmax Linear
FCs
Trainable
ConvNet
Source: R. Girshick R. Girshick, Fast R-CNN, ICCV 2015
Multi-task loss
• Loss for ground truth class 𝑦, predicted class probabilities 𝑃(𝑦), ground
truth box 𝑏, and predicted box 𝑏:
𝐿 𝑦, 𝑃, 𝑏, 𝑏 = −log 𝑃(𝑦) + 𝜆𝕀[𝑦 ≥ 1]𝐿reg (𝑏, 𝑏)
softmax loss regression loss
• Regression loss: smooth 𝐿1 loss on top of log space offsets relative to
proposal
𝐿reg 𝑏, 𝑏 = smooth𝐿1 (𝑏𝑖 − 𝑏𝑖 )
𝑖={𝑥,𝑦,𝑤,ℎ}
Bounding box regression
Ground truth box
Target offset
to predict*
Region proposal
Predicted (a.k.a default box,
Loss
offset prior, reference,
anchor)
Predicted
box
*Typically in transformed,
normalized coordinates
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1
RoI pooling
𝑧1,4
𝑟1 𝑧2,1
𝑥33 𝑟2
𝑟2
RoI pooling
Feature Map
Source: Ross
Girshick
ROI pooling: Backpropagation
• Similar to max pooling, but has to take into account overlap of
pooling regions
𝑟1
𝑖 ∗ 1,4 = 33 𝑧1,4
𝑖 ∗ 2,1 = 33 𝑧2,1
𝑟1
Backward Pass:
max pooling 𝜕𝑒
𝑥33 “switch” 𝑟2 Have ,
𝜕𝑧
(argmax 𝜕𝑒
want
back-pointer) 𝜕𝑥
𝑟2
𝜕𝑒 𝜕𝑒 𝜕𝑧𝑟𝑗 ∗
𝜕𝑒
= = 𝕀 𝑖 = 𝑖 𝑟, 𝑗
𝜕𝑥𝑖 𝜕𝑧𝑟𝑗 𝜕𝑥𝑖 𝜕𝑧𝑟𝑗
𝑟 𝑗 𝑟 𝑗
Over regions 𝑟, 1 if 𝑟, 𝑗 “pooled”
RoI indices 𝑗 input 𝑖; 0 o/w Source: Ross Girshick
Mini-batch sampling
• Sample a few images (e.g., 2)
• Sample many regions from each image (64)
... ... ... ...
Sample images
SGD mini-batch
Source: R. Girshick, K. He
Fast R-CNN results
Fast R-CNN R-CNN
Train time (h) 9.5 84
- Speedup 8.8x
Test time / image 0.32s 47.0s
- Test speedup 146x
mAP 66.9% 66.0% (vs. 53.7% for AlexNet)
Timings exclude object proposal time, which is equal for all methods.
All methods use VGG16.
Source: R. Girshick, K. He
Faster R-CNN
Region
proposals
Region Proposal
Network feature map
feature map
share features
CNN CNN
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with
Region Proposal Networks, NIPS 2015
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
Anchor is
an object?
Figure source: J. Johnson
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
Anchor is
an object?
Figure source: J. Johnson
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
Conv
Anchor is
an object?
Figure source: J. Johnson
Region proposal network (RPN)
• Idea: put an “anchor box” of fixed size over each position in
the feature map and try to predict whether this box is likely to
contain an object
• Introduce anchor boxes at multiple scales and aspect ratios
to handle a wider range of object sizes and shapes
Anchor is object?
Conv
Anchor is object?
Anchor is object?
Anchor is object?
Figure source: J. Johnson
Faster R-CNN RPN design
• Slide a small window (3x3) over the conv5 layer
• Predict object/no object
• Regress bounding box coordinates with reference to anchors
(3 scales x 3 aspect ratios)
One network, four losses
Classification Bounding-box
loss regression loss
…
Classification Bounding-box
loss regression loss RoI pooling
proposals
Region Proposal
Network
feature map
CNN
image
Source: R. Girshick, K. He
Faster R-CNN results
Object detection progress
Faster R-CNN
Fast R-CNN
Before CNNs R-CNNv1
After CNNs
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
Streamlined detection architectures
• The Faster R-CNN pipeline separates proposal generation
and region classification
RPN Region Classification +
Proposals Regression
Conv feature RoI RoI
map of the pooling Detections
features
entire image
• Is it possible to do detection in one shot?
Classification +
Conv feature Regression
map of the Detections
entire image
YOLO
• Divide the image into a coarse grid and directly predict class
label and a few candidate boxes for each grid cell
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
1. Take conv feature maps at 7x7 resolution
2. Add two FC layers to predict, at each location,
a score for each class and 2 bboxes w/ confidences
• For PASCAL, output is 7 × 7 × 30 (30 = 20 + 2 ∗ (4 + 1))
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, CVPR 2016
YOLO
• Objective function:
Regression
Object/no object
confidence
Class prediction
YOLO
• Objective function:
Cell i contains object,
predictor j is
responsible for it
Small deviations matter
less for larger boxes
than for smaller boxes
Confidence for object
Confidence for no object
Down-weight loss from Class probability
boxes that don’t contain
objects (𝜆noobj = 0.5)
YOLO: Results
• Each grid cell predicts only two boxes and can only have one class –
this limits the number of nearby objects that can be predicted
• Localization accuracy suffers compared to Fast(er) R-CNN due to
coarser features, errors on small boxes
• 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)
Performance on PASCAL 2007
YOLO v2
• Remove FC layer, do VOC 2007 results
convolutional prediction
with anchor boxes
instead
• Increase resolution of
input images and conv
feature maps
• Improve accuracy using
batch normalization and
other tricks YouTube demo
J. Redmon and A. Farhadi, YOLO9000: Better, Faster, Stronger, CVPR 2017
Multi-resolution prediction: SSD
• Predict boxes of different size from different conv maps
• Each level of resolution has its own predictor
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Multi-resolution prediction: SSD
• Predict boxes of different size from different conv maps
• Each level of resolution has its own predictor
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, SSD: Single Shot MultiBox Detector, ECCV 2016
Feature pyramid networks
• Improve predictive power of
lower-level feature maps by
adding contextual information
from higher-level feature maps
• Predict different sizes of
bounding boxes from different
levels of the pyramid (but
share parameters of
predictors)
T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, CVPR 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet
• Combine feature pyramid network with focal loss to reduce the standard
cross-entropy loss for well-classified examples
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
RetinaNet: Results
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal loss for dense object detection, ICCV 2017
Outline
• Task definition and evaluation
• Two-stage detectors
• R-CNN
• Fast R-CNN
• Faster R-CNN
• Single-stage and multi-resolution detectors
• Recent trends
CornerNet
H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CornerNet
H. Law and J. Deng, CornerNet: Detecting Objects as Paired Keypoints, ECCV 2018
CenterNet
• Use an additional center point to verify predictions:
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
CenterNet
K. Duan et al. CenterNet: Keypoint Triplets for Object Detection, ICCV 2019
Detection Transformer (DETR)
N. Carion et al., End-to-end object detection with transformers, ECCV 2020