Instance Segmentation
Riley Simmons-Edler, Berthy Feng
Instance Segmentation Task
● Label each foreground pixel with object
and instance
● Object detection + semantic
segmentation
Slide Credit: Kaiming He
In This Lecture...
● Microsoft COCO dataset
● Mask R-CNN (fully supervised)
● MaskX R-CNN (partially supervised)
Microsoft COCO:
Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, et al.
“Microsoft COCO: Common Objects in Context.” arXiv,
2015.
Previous Datasets
● ImageNet: many object
categories
● PASCAL VOC: object
detection in natural images,
small number of classes
● SUN: labeling scene types and
commonly occurring objects,
but not many instances per
category
Image Credit: Tsung-Yi Lin et al.
Goal: Push research in scene understanding
1. Detecting non-iconic views
2. Contextual reasoning between objects
3. Precise 2D localization of objects
MS COCO Dataset
❖ 91 object
classes
❖ 328,000
images
❖ 2.5 million
labeled
instances
Image Credit: Tsung-Yi Lin et al.
Image Collection & Annotation
Object Categories
Image Credit: Tsung-Yi Lin et al.
Non-Iconic Image Collection
Image Credit: Tsung-Yi Lin et al.
Annotation
Image Credit: Tsung-Yi Lin et al.
Dataset Evaluation
Statistics
Image Credit: Tsung-Yi Lin et al.
Statistics
Image Credit: Tsung-Yi Lin et al.
COCO Detection Challenge
Image Credit: Tsung-Yi Lin et al.
COCO Keypoint Challenge
Image Credit: Tsung-Yi Lin et al.
COCO Stuff Challenge
Image Credit: Tsung-Yi Lin et al.
COCO Places Challenges
Image Credit: Tsung-Yi Lin et al.
Mask R-CNN
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross
Girshick. “Mask R-CNN.” ICCV, 2017.
Faster R-CNN
Fast R-CNN
Image Credit: Shaoqing Ren et al. Image Credit: Tomasz Grel
Insight: Region Proposal and Detection Use
Same Features
Image Credit: Shaoqing Ren et al.
Faster R-CNN = RPN + Fast R-CNN
RPN = Fully Convolutional Network
Extending to Instance
Segmentation
Visual Perception Problems
Slide Credit: Kaiming He
Instance Segmentation Methods
Slide Credit: Kaiming He
Insight: Mask Prediction in Parallel
Slide Credit: Kaiming He
RoIPool
Image Credit: Tomasz Grel
RoIPool
Slide Credit: Kaiming He
RoIAlign
Slide Credit: Kaiming He
Mask R-CNN
Mask R-CNN Results
Examples
● Mask AP =
35.7
Image Credit: Kaiming He et al.
Comparisons
Image Credit: Kaiming He et al.
Comparisons
Image Credit: Kaiming He et al.
Application: Human Pose Estimation
Image Credit: Kaiming He et al.
Mask R-CNN Recap
● Add parallel mask prediction head to Faster-RCNN
● RoIAlign allows for precise localization
● Mask R-CNN improves on AP of previous state-of-the-art, can be
applied in human pose estimation
Learning to Segment Every Thing
Ronghang Hu, Piotr Dollar, Kaiming He, Trevor Darrell, and
Ross Girshick. “Learning to Segment Every Thing.” arXiv,
2017.
Partially Supervised Model
Motivation for a Partially Supervised Model
A = set of object B = set of object
categories with categories with only
complete mask bounding boxes (no
annotations segmentation
annotations)
How can we know C = A U B?
Image Credit: Ronghang Hu et al.
Transfer Learning
Image Credit: Ronghang Hu et al.
Weight Transfer Function
Image Credit: Ronghang Hu et al.
Training
● Train bounding box head using standard box detection losses on all
classes in A U B
● Train mask head, weight transfer function using mask loss on classes in A
Image Credit: Ronghang Hu et al.
Stage-Wise Training
1. Detection training ● Train detection once and then
2. Segmentation training fine-tune weight transfer function
● Inferior performance
Image Credit: Ronghang Hu et al.
End-to-End Joint Training
● Jointly train detection head and mask head end-to-end
● Want detection weights to stay constant between A and B
Image Credit: Ronghang Hu et al.
End-to-End Training Better
Image Credit: Ronghang Hu et al.
Mask Prediction
Baseline: Class-agonistic FCN mask prediction
Extension: FCN+MLP mask heads
Image Credit: Ronghang Hu et al.
Results
Examples
Image Credit: Ronghang Hu et al.
Comparisons
Image Credit: Ronghang Hu et al.
Segmenting Everything
Image Credit: Ronghang Hu et al.