Deep Learning
Object Detection and Segmentation
Huỳnh Văn Thống
FPT Univ.
Semantic Segmentation
• Label each pixel in the image with
a category label.
• Don’t differentiate instances, only
care about pixels.
2/24/2025 2
Segmentation: Dataset
• Pascal VOC: 16k training natural images divided into 20 classes.
• Cityscapes: 25K urban-street images divided into 30 classes.
• ADE20K: 25K (20 stands for 20K training) scene-parsing images
divided into 150 classes.
• MS COCO: 328K images with 80 things categories and 91 stuff
categories.
Models are often pre-trained in the
large MS-COCO dataset, before
finetuned to the specific dataset.
2/24/2025 3
Semantic Segmentation: FCN
• FCN = Fully Convolutional Network.
• Design a network as a bunch of convolutional layers to make
predictions for pixels all at once.
2/24/2025 4
Semantic Segmentation: FCN
• Design a network as a bunch of convolutional layers to make
predictions for pixels all at once.
Problem #1: Effective receptive field size Problem #2: Convolution on high res
is linear in number of conv layers: With L images is expensive! Recall ResNet stem
3x3 conv layers, receptive field is 1+2L aggressively downsamples.
2/24/2025 5
Semantic Segmentation: FCN
• Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
2/24/2025 6
Semantic Segmentation: FCN
• Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Downsampling: Upsampling : ?
Pooling, strided convolution
2/24/2025 7
In-Network Upsampling: “Unpooling”
2/24/2025 8
In-Network Upsampling: Bilinear Interpolation
Use two closest neighbors in 𝑥 and 𝑦
to construct linear approximations
2/24/2025 9
In-Network Upsampling: Bicubic Interpolation
Use three closest neighbors in 𝑥 and 𝑦 to
construct cubic approximations.
(This is how we normally resize images)
2/24/2025 10
In-Network Upsampling: “Max Unpooling”
Max Pooling: Remember Max Unpooling: Place into
which position had the max remembered positions
Pair each downsampling layer with
an upsampling layer
2/24/2025 11
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 1, pad 1
2/24/2025 12
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
2/24/2025 13
Learnable Upsampling: Transposed Convolution
Recall: Normal 3 x 3 convolution, stride 2, pad 1
Convolution with stride > 1 is “Learnable Downsampling”
Can we use stride < 1 for “Learnable Upsampling”?
2/24/2025 14
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2
2/24/2025 15
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2
2/24/2025 16
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 Sum where outputs
are overlap
2/24/2025 17
Learnable Upsampling: Transposed Convolution
3 x 3 transposed convolution, stride 2 Sum where outputs
are overlap
2/24/2025 18
Transposed Convolution: 1D example
Output has copies of filter
weighted by input.
Stride 2: Move 2 pixels output
for each pixel in input.
Sum at overlaps.
2/24/2025 19
Transposed Convolution: 1D example
Many name:
• Deconvolution (bad).
• Upconvolution.
• Fractionally strided
convolution.
• Backward strided
convolution.
• Transposed Convolution
(best).
2/24/2025 20
Semantic Segmentation: FCN
• Design network as a bunch of convolutional layers, with
downsampling and upsampling inside the network!
Downsampling: Upsampling :
Pooling, strided convolution Iinterpolation,
transposed conv
2/24/2025 21
Semantic Segmentation: FCN
• Combine predictions with different resolutions
Fully Convolutional Networks for Semantic Segmentation. Long et al., CVPR, 2015
2/24/2025 22
Semantic Segmentation: U-Net
• Incorporating the low-
level information.
U-Net: Convolutional Networks for Biomedical Image
Segmentation, Ronneberger et al., MICCAI 2015
2/24/2025 23
Semantic Segmentation: DeepLabV3+
• Encode multi-scale contextual
information by applying atrous
convolution at multiple scales
Encoder-Decoder with Atrous Separable Convolution
for Semantic Image Segmentation, Chen et al., ECCV
2/24/2025 2018 24
Atrous Convolution
Sparse feature extraction with
standard convolution on a
low-resolution input feature
map.
Dense feature extraction with
atrous convolution with rate r=2,
applied on a high-resolution input
feature map.
2/24/2025 25
Semantic Segmentation: DeepLabV3+
• Encode multi-scale contextual
information by applying atrous
convolution at multiple scales.
• Refine the segmentation
results along object
boundaries.
Encoder-Decoder with Atrous Separable Convolution
for Semantic Image Segmentation, Chen et al., ECCV
2/24/2025 2018 26
Computer Vision Tasks
Object Detection: Detects individual Semantic Segmentation: Gives per
object instances, but only gives box. pixel labels, but merges instances
2/24/2025 27
Things and Stuff
Things: Object categories that
can be separated into object
instances (e.g. cats, cars,
person).
Stuff: Object categories that
cannot be separated into
instances (e.g. sky, grass,
water, trees)
2/24/2025 28
Computer Vision Tasks
Object Detection: Detects individual Semantic Segmentation: Gives per
object instances, but only gives box. pixel labels, but merges instances.
(Only things) (Both things and stuff)
2/24/2025 29
Computer Vision Tasks
Instance Segmentation: Detect all objects Semantic Segmentation: Gives per
in the image and identify the pixels that pixel labels, but merges instances.
belong to each object. (Only things!) (Both things and stuff)
2/24/2025 30
Computer Vision Tasks: Instance Segmentation
Instance Segmentation: Detect all
objects in the image, and identify the
pixels that belong to each object.
(Only things!)
Approach: Perform object detection,
then predict a segmentation mask
for each object!
2/24/2025 31
Beyond Instance Segmentation: Panoptic Segmentation
• Label all pixels in the image
(both things and stuff).
• For “thing” categories also
separate into instances.
2/24/2025 32
Beyond Instance Segmentation: Panoptic Segmentation
2/24/2025 33
Panoptic quality (PQ) measure
• Computed per-category and results are averaged
across categories.
• The ground truth and predicted segments are
matched with an IoU threshold 0.5
• TP (matched pairs), FP (unmatched predicted
segments), and FN (unmatched ground truth
segments).
SQ: how close the predicted segments are to the
ground truth segment (does not consider bad RQ: just like for detection, we want to know if we are missing
predictions!) any instances (FN) or predicting more instances (FP)
2/24/2025 34
Next
• Visualization and Understanding
• Attention and Transformer
• Foundation Models and Promptable Segmentation.
• ….
2/24/2025 35
Questions?
2/24/2025 36