[go: up one dir, main page]

0% found this document useful (0 votes)
41 views18 pages

CondInst: Dynamic Convolutions for Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views18 pages

CondInst: Dynamic Convolutions for Segmentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

1

Instance and Panoptic Segmentation Using


Conditional Convolutions
Zhi Tian, Bowen Zhang, Hao Chen, Chunhua Shen

Abstract—We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional
convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow
the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to
attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head
of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three
advantages: 1) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping
arXiv:2102.03026v5 [cs.CV] 20 Jan 2022

and feature alignment. 2) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3) Due to
the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv.
layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time
less relevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed
on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that
CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet

Index Terms—Fully convolutional networks, conditional convolutions, instance segmentation, panoptic segmentation

framework with an extra semantic segmentation branch.


mask head 1 Therefore, both instance and panoptic segmentation share
conv
conv

conv

the same key challenge—-how to efficiently and effectively


distinguish individual instances.

Despite a few works being proposed recently, the dom-



mask head K inant method tackling this challenge is still the two-stage
method such as Mask R-CNN [2], which casts instance
conv
conv

conv

segmentation into a two-stage detection-and-segmentation


task. To be specific, Mask R-CNN first employs an ob-
ject detector Faster R-CNN to predict a bounding-box for
features instance-aware output each instance. Then for each instance, regions-of-interest
w/ rel. coord. mask heads instance masks (ROIs) are cropped from the networks’ feature maps us-
Fig. 1 – CondInst uses instance-aware mask heads to predict ing the ROIAlign operation. To predict the final masks
the mask for each instance. K is the number of instances for each instance, a compact fully convolutional network
to be predicted. Note that each output map only contains (FCN) (i.e., mask head) is applied to these ROIs to perform
the mask of one instance. The filters in the mask head vary foreground/background segmentation. However, this ROI-
with different instances, which are dynamically-generated based method may have the following drawbacks. 1) Since
and conditioned on the target instance. ReLU is used as the ROIs are often axis-aligned bounding-boxes, for objects with
activation function (excluding the last conv. layer).
irregular shapes, they may contain an excessive amount of
irrelevant image content including background and other
1 I NTRODUCTION instances. This issue may be mitigated by using rotated
RoIs [3], but with the price of a more complex pipeline.
Instance segmentation is a fundamental yet challenging task
2) In order to distinguish between the foreground instance
in computer vision, which requires an algorithm to predict
and the background stuff or instance(s) with the fixed mask
a per-pixel mask with a category label for each instance of
head, the mask head needs a strong capacity and a relatively
interest in an image. Panoptic segmentation further requires
larger receptive field to encode sufficiently large context
the algorithm to segment the stuff (e.g., sky and grass),
information. As a result, a stack of 3 × 3 convolutions is
assigning every pixel in the image a semantic label. Panoptic
used in the mask head (e.g., four 3 × 3 convolutions with
segmentation is often built on an instance segmentation
256 channels in Mask R-CNN). It considerably increases
Accepted to IEEE Trans. on Pattern Analysis and Machine Intelligence computational complexity of the mask head, resulting that
(TPAMI), 20 Jan. 2022. the inference time significantly varies in the number of
Work was done when all authors were with The University of Adelaide, instances. 3) ROIs are typically of different sizes. In order to
Australia.i B. Zhang is with The University of Adelaide. Z. Tian is with
Meituan Inc. H. Chen and C. Shen are with Zhejiang University, China. use effective batched computation in modern deep learning
C. Shen is the corresponding author (e-mail: chunhua@me.com). frameworks [4], [5], a resizing operation is often required
2

ours ours YOLACT M-RCNN ours ours YOLACT M-RCNN

ours ours YOLACT M-RCNN ours ours YOLACT M-RCNN

Figure
Fig. 1.
2 –Qualitative
Qualitative comparisons
comparisons with
with other
other methods.
methods. WeWe compare
comparethetheproposed
proposedCondInst
CondInstagainst
againstYOLACT
YOLACT[1] [1]
andand Mask
Mask
R-CNN [2]. Our masks are generally of higher quality (e.g., preserving finer details). Best viewed on screen.
R-CNN [2]. Our masks are generally of higher quality (e.g., preserving more details). Best viewed on screen.

to resize the cropped regions into patches of the same size. a set of fixed convolutional filters as the mask head for
theFor
major difficulty
instance, Maskof applying
R-CNN FCNs
resizes to instance
all the cropped segmen-
regions to ROI operations.
predicting At the
all instances, thefirst glance,
network the idea may
parameters in ournot work
mask
tation
14×14is that the similar
(upsampled image appearance
to 28×28 may requirewhich
using a deconvolution), dif- wellare
head as instance-wise
adapted according masktoheads may incur
the instance to bea large number
predicted.
ferent predictions
restricts the outputbut resolution
FCNs struggle at achieving
of instance this. Foras
segmentation, of network
Inspired parameters
by dynamic provided
filtering networksthat [16]
some and images
CondConvcontain
example, if two persons
large instances A and Bhigher
would require with the similar appear-
resolutions to retain as many as dozens of instances. However, we show that
[17], for each instance, a controller sub-network (see Fig. 3) a
details at the boundary.
ance are in an input image, when predicting the instance dynamically generates the mask head’s
very compact FCN mask head with dynamically-generated filters (conditioned
on the center area of the instance), which is then used
mask Inof computer
A, the FCN vision, the closest task to instance segmenta-
needs to predict B as background w.r.t. filters can already outperform previous ROI-based Mask R-
tion is semantic segmentation, for which fully convolutional to predict the mask of this instance. It is expected that
A,networks
which can(FCNs)
be difficult as they look similar in appearance.
have shown dramatic success [6]–[10].
CNN,
the resulting
network in muchcan
parameters reduced
encodecomputational
the characteristicscomplexity
(e.g.,
Therefore,
FCNs also thehave
ROI operation is used performance
shown excellent to crop the personon manyof per instance
relative thanshape
position, that ofand the appearance)
mask head inofMask this R-CNN.
instance,
interest, e.g., A; and
other per-pixel filter out
prediction B. ranging
tasks Essentially,
frominstance
low-levelseg-
im- We summarize
and only fires on theour main
pixels of contributions
this instance, as follow.
which thus by-
mentation needs two
age processing suchtypes of information:
as denoising, 1) appearance
super-resolution; to mid- passes the difficulty in the standard FCNs. These conditional
information
level taskstosuchcategorize objects;
as optical flowand 2) location
estimation informa-
and contour mask• heads
We attempt to solve
are applied instance
to the whole segmentation
high-resolution from a new
feature
tion to distinguish
detection; multipletasks
and high-level objects belonging
including to single-shot
recent the same maps, perspective.
thus eliminating Tothethisneed
end, forwe
ROI propose the At
operations. CondInst
the firstin-
object detection
category. Almost all [11], monocular
methods rely depth
on ROIestimation
cropping,[12]–[14]
which glance,stance
the idea segmentation
may not work framework, which achieves
well as instance-wise mask im-
and counting [15]. However, almost all the instance segmen- heads proved
may incur instance
a largesegmentation
number ofperformance
network parametersthan exist-
explicitly encodes the location information of instances. In
tation methods based on FCNs1 lag behind state-of-the-art provided ing that
methods somesuch as Mask
images R-CNN
contain while as
as many being faster.
dozens
contrast, CondInst exploits the location information by us- of instances. However, as the mask head filters are only
ROI-based methods. Why do the versatile FCNs perform To our knowledge, this is the first time that a new
ingunsatisfactorily
location/instance-sensitive
on instance convolution
segmentation? filters
Thisasiswell
dueasto asked to predict the mask of only one instance, it largely
relative coordinates that are appended instance segmentation framework outperforms recent
the fact that the FCNs tend to yield to the feature
similar map. for
predictions eases the learning requirement and thus reduces the load
similar image appearance. of the state-of-the-art
filters. As a result, boththein accuracy
mask head andcanspeed.
be extremely
Thus, we advocate a new As a result,
solution thatthe vanilla
uses FCNs
instance-
are incapable of distinguishing individual instances. For
aware FCNs for instance mask prediction. In other words, • CondInst
light-weight. We is fully
will showconvolutional and avoids
that a very compact mask thehead
afore-
example, if two persons A and B with the similar appearance with dynamically-generated filters can already
mentioned resizing operation in many existing meth- outperform
instead
are in of
anusing
input aimage,
standard
whenConvNet with
predicting thea instance
fixed setmaskof previous ROI-based Mask R-CNN. This compact mask head
convolutional filters as the mask head for predicting all in- ods, as CondInst does not rely on ROI operations.
of A, the FCN needs to predict B as background w.r.t. A, also results in much reduced computational complexity per
stances, Without having to resize feature maps leads to high-
which thecan network
be difficultparameters are adapted
as they look similar in according
appearance.to instance than that of the mask head in Mask R-CNN.
theTherefore,
instance toanbe ROIpredicted.
operationInspired
is used byto dynamic filteringof
crop the person We resolution
summarizeinstance
our main masks with more
contributions asaccurate
follow. edges.
networks
interest,[10]
i.e., and CondConv
A; and filter out[11], for each instance,
B. Essentially, a con-
this is the core • Unlike previous methods, in which the filters in its
• We attempt to solve instance segmentation from a
troller sub-network (see Fig. 2) dynamically generates the
operation making the model attend to an instance. mask head are fixed for all the instances once trained,
new perspective that uses dynamic mask heads. This
mask In FCNthisnetwork
work, we advocate (conditioned
parameters a new solution on for
the instance
center the
novelfilters in our
solution mask improved
achieves head are dynamically
instance segmen-gener-
segmentation, termed CondInst. Instead of using ROIs,
area of the instance), which is then used to predict the mask ated
tationand conditionedthan
performance on instances. As the such
existing methods filtersasare
CondInst attends to each instance by using instance-sensitive
of convolution
this instance. It is expected that the network parameters
filters as well as relative coordinates that are
only
MaskaskedR-CNN. to predict
To ourthe mask of only
knowledge, this one instance,
is the first
canappended
encode the characteristics
to the feature maps. of thisSpecifically,
instance, and only fires
unlike Mask ittime
largely
that eases
a newthe learning
instance requirementframework
segmentation and thus re-
onR-CNN,
the pixels of this instance, which thus
which uses a standard convolution networkbypasses the dif-
with duces the load
outperforms of the
recent filters. As both
state-of-the-art a result, the mask
in accuracy
ficulty mentioned above. These conditional mask heads are head can be extremely light-weight, significantly re-
and speed.
applied
1. BytoFCNs,
the whole feature
we mean maps,FCNs
the vanilla eliminating theonly
in [6] that needinvolve
for • CondInst
ducing theisinference
fully convolutional and avoids
time per instance. the afore-
Compared with
convolutions and pooling. mentioned resizing operation used in many existing

2
3

methods, as CondInst does not rely on ROI opera- segmentation with classical watershed transform, and object
tions. Without having to resize feature maps leads to instances can be viewed as the energy basins in the energy
high-resolution instance masks with more accurate map of the watershed transform of an image. SGN [28] uses
edges, as shown in Fig. 2. a sequence of networks to gradually group the raw pixels to
• Since the mask head in CondInst is very compact and line segments, connected components, and finally object in-
light-weight, compared with the box detector FCOS, stances, achieving impressive performance. The single-shot
CondInst needs only ∼10% more computational time Box2Pix [29] solves instance segmentation in the bottom-
(less than 5 milliseconds) to obtain the mask re- up fashion. Novotny et al. [30] propose semi-convolutional
sults of all the instances, even when processing the operators to make FCNs applicable to instance segmen-
maximum number of instances per image (i.e., 100 tation. Arnab et al. [31] propose dynamically instantiated
instances). As a result, the overall inference time is CRF (Conditional Random Field) for instance segmentation,
stable as it almost does not depend on the number of which is able to produce a variable number of instances per
instances in the image. image. To our knowledge, thus far none of these methods
• With an extra semantic segmentation branch, can outperform Mask R-CNN both in accuracy and speed
CondInst can be easily extended to panoptic segmen- on the COCO benchmark dataset.
tation [18], resulting a unified fully convolutional The recent YOLACT [1] and BlendMask [32] may be
network for both instance and panoptic segmenta- viewed as a reformulation of Mask RCNN, which decouples
tion tasks. ROI detection and feature maps used for mask prediction.
• CondInst achieves state-of-the-art performance on Wang et al. developed a simple FCN based instance segmen-
both instance and panoptic segmentation tasks while tation method, which segments the instances by the their
being fast and simple. We hope that CondInst can be locations, showing competitive performance [33], [34]. Po-
a new strong alternative for instance and panoptic larMask [35] developed a new simple mask representation
segmentation tasks, as well as other instance-level for instance segmentation, which extends the bounding box
recognition tasks such as keypoint detection. detector FCOS [11].
Panoptic segmentation. There are two main approaches for
solving this task. The first one is the bottom-up approach. It
2 R ELATED W ORK tackles the task as a semantic segmentation at first and then
Here we review some work that is most relevant to ours. uses clustering/grouping methods to assemble the pixels
Conditional Convolutions/Dynamic filters. Unlike tradi- into individual instances or stuff [36], [37]. The authors of
tional convolutional layers, which have fixed filters once [37] also explore the weakly- or semi-supervised panoptic
trained, the filters of conditional convolutions are condi- segmentation. The second approach is the top-down ap-
tioned on the input and are dynamically generated by an- proach, which is often built on top-down instance segmen-
other network (i.e., a controller). This idea has been explored tation methods. For example, Panoptic-FPN [38] extends
previously in dynamic filter networks [16] and CondConv an additional semantic segmentation branch from Mask R-
[17] mainly for the purpose of increasing the capacity of a CNN and combines the results with the instance segmen-
classification network. DGMN [19] also employs dynamic tation results generated by Mask R-CNN [18]. Moreover,
filters to generate the node-specific filters for message cal- attention based methods recently gain much popularity in
culation, which improves the capacity of the networks and many computer vision tasks, which provide a new approach
thus results in better performance. In this work, we extend to panoptic segmentation. Axial DeepLab [39] used a care-
this idea to generate the mask head’s filters conditioned fully designed module to enable attention to be applied
on each instance, and present a high-performance instance to large-size images for panoptic segmentation. CondInst
segmentation method without the need for ROIs. can easily be applied to panoptic segmentation following
Instance Segmentation. To date, the dominant frame- the top-down approaches. We empirically observe that the
work for instance segmentation is still Mask R-CNN. quality of the instance segmentation results may be the
Mask R-CNN first employs an object detector to detect dominant factor to the final performance. Thus in CondInst,
the bounding-boxes of instances (e.g., ROIs). With these without bells and whistles, by simply applying the same
bounding-boxes, an ROI operation is used to crop the method used by Panoptic-FPN, the panoptic segmentation
features of the instance from the feature maps. Finally, an performance of CondInst is already competitive compared
FCN head is used to obtain the desired instance masks. to the state-of-the-art panoptic segmentation methods.
Many works [20]–[22] with top performance are built on Additionally, AdaptIS [40] recently proposes to solve
Mask R-CNN. Moreover, some works have explored to panoptic segmentation with FiLM [41]. The idea shares
apply the standard FCNs [6] to instance segmentation. some similarity with CondInst in which information about
InstanceFCN [23] may be the first instance segmentation an instance is encoded in the coefficients generated by
method that is fully convolutional. InstanceFCN proposes FiLM. Since only the batch normalization coefficients are
to predict position-sensitive score maps with vanilla FCNs. dynamically generated, AdaptIS needs a large mask head
Afterwards, these score maps are assembled to obtain the to achieve good performance. In contrast, CondInst directly
desired instance masks. Note that InstanceFCN does not encodes them into the conv. filters of the mask head, which
work well with overlapping instances. Others [24]–[26] at- is much more straightforward and efficient. Also, as shown
tempt to first perform image segmentation and then the in experiments, CondInst can achieve much better panoptic
desired instance masks are formed by assembling the pixels segmentation accuracy than AdaptIS, which suggests that
of the same instance. Deep Watershed [27] models instance CondInst is much more effective.
4

head

head shared head

head Convs classification px, y


controller
Convs (generating filters 𝜽x, y)
head

head

assign to
output instance masks

Bottom module
append mask FCN head …
rel. coord.

conv
conv

conv
Fig. 3 – The overall architecture of CondInst. C3 , C4 and C5 are the feature maps of the backbone network (e.g., ResNet-50).
P3 to P7 are the FPN feature maps as in [11], [42]. Fbottom is the bottom branch’s output, whose resolution is the same as that
of P3 . Following [32], the bottom branch aggregates the feature maps P3 , P4 and P5 . F̃bottom is obtained by concatenating the
relative coordinates to Fbottom . The classification head predicts the class probability p x,y of the target instance at location (x, y),
same as in FCOS. The controller generates the filter parameters θ x,y of the mask head for the instance. Similar to FCOS, there
are also center-ness and box heads in parallel with the controller (not shown in the figure for simplicity). Note that the heads in
the dashed box are repeatedly applied to P3 · · · P7 . The mask head is instance-aware, and is applied to F̃bottom as many times
as the number of instances in the image (refer to Fig. 1).

3 O UR M ETHODS : I NSTANCE AND PANOPTIC pixels of the instance, thus producing the mask prediction
S EGMENTATION WITH C OND I NST of the instance and distinguishing individual instances. We
illustrate the process in Fig. 1. The instance-aware filters
We first present CondInst for instance segmentation, and
are generated by modifying an object detector. Specifically,
then we show how the instance segmentation framework
we add a new controller branch to generate the filters for
can be easily extended to panoptic segmentation by using a
the target instance of each box predicted by the detector,
new semantic branch.
as shown in Fig. 3. Therefore, the number of the dynamic
mask heads is the same as the number of the predicted
3.1 Overall Architecture for Instance Segmentation boxes, which should be the number of the instances in the
Given an input image I ∈ RH×W ×3 , the goal of instance image if the detector works well. In this work, we build
segmentation is to predict the pixel-level mask and the cat- CondInst on the popular object detector FCOS [11] due to
egory of each instance of interest in the image. The ground- its simplicity and flexibility. Also, the elimination of anchor-
truths are defined as {(Mi , ci )}, where Mi ∈ {0, 1}H×W is boxes in FCOS can also save the number of parameters and
the mask for the i-th instance and ci ∈ {1, 2, ..., C} is the the amount of computation.
category. C is 80 on MS-COCO [43]. In semantic segmen-
As shown in Fig. 3, following FCOS [11], we make use
tation, the prediction target of each pixel are well-defined,
of the feature maps {P3 , P4 , P5 , P6 , P7 } of feature pyramid
which is the semantic category of the pixel. In addition, the
networks (FPNs) [42], whose down-sampling ratios are 8,
number of categories is known and fixed. Thus, the outputs
16, 32, 64 and 128, respectively. As shown in Fig. 3, on each
of semantic segmentation can be easily represented with the
feature level of the FPN, some functional layers (in the dash
output feature maps of the FCNs, and each channel of the
box) are applied to make instance-aware predictions. For ex-
output feature maps corresponds to a class. However, in
ample, the class of the target instance and the dynamically-
instance segmentation, the prediction target of each pixel
generated filters for the instance. In this sense, CondInst can
is hard to define because instance segmentation also re-
be viewed as the same as Mask R-CNN, both of which first
quires to distinguish individual instances, but the number
attend to instances in an image and then predict the pixel-
of instances changes in different images. This poses a major
level masks of the instances (i.e., instance-first).
challenge when applying traditional FCNs [6] to instance
segmentation. Moreover, recall that Mask R-CNN employs an object
In this work, our core idea is that for an image with detector to predict the bounding-boxes of the instances in
K instances, K different mask heads will be dynamically the input image. The bounding-boxes are actually the way
generated, and each mask head will contain the characteris- that Mask R-CNN represents instances. Similarly, CondInst
tics of its target instance in their filters. As a result, when employs the instance-aware filters to represent the instances.
the mask is applied to an input, it will only fire on the In other words, instead of encoding the instance information
5

with the bounding-boxes, CondInst implicitly encodes it the input image size is 800 × 1024). The mask’s resolution
with the parameters of the generated dynamic filters, which is much higher than that of Mask R-CNN (only 28 × 28 as
is much more flexible. For example, the dynamic filters can mentioned before).
easily represent the irregular shapes that are hard to be
tightly enclosed by a bounding-box (elaborated in Sec. 4.4).
3.2 Network Outputs and Training Targets
This is one of CondInst’s advantages over the previous ROI-
based methods. Similar to FCOS, each location on the FPN’s feature maps Pi
Besides the detector, as shown in Fig. 3, there is also a either is associated with an instance, thus being a positive
bottom branch, which provides the feature maps (denoted sample, or is considered as a negative sample. The associ-
by Fbottom ) that our generated mask heads take as inputs ated instance and label for each location are determined as
to predict the desired instance mask. The bottom branch follows.
aggregates the FPN feature maps P3 , P4 and P5 . To be Let us consider the feature maps Pi ∈ RH×W ×C and let
specific, P4 and P5 are upsampled to the resolution of P3 s be its down-sampling ratio. As shown in previous works
with bilinear interpolation and added to P3 . After that, [11], [44], [45], a location (x, y) on the feature maps can be
four 3 × 3 convolutions with 128 channels are applied. mapped back onto the input image as (bs/2c+xs, bs/2c+ys).
The resolution of the resulting feature maps is the same as If the mapped location falls in the center region of an
P3 (i.e., 18 of the input image resolution). Finally, another instance, the location is considered to be responsible for
convolutional layer is used to reduce the number of the the instance. Any locations outside the center regions are
output channels Cbottom from 128 to 8, resulting in the labeled as negative samples. The center region is defined as
bottom feature Fbottom . The small output channel reduces the box (cx − rs, cy − rs, cx + rs, cy + rs), where (cx , cy )
the number of the generated parameters. We empirically denotes the mass center of the instance mask, s is the down-
found that using Cbottom = 8 can already achieve good sampling ratio of Pi and r is a constant scalar being 1.5 as
performance, and as shown in our experiments, a larger in FCOS [11]. As shown in Fig. 3, at a location (x, y) on Pi ,
Cbottom here (e.g., 16) cannot improve the performance. Even CondInst has the following output heads.
more aggressively, using Cbottom = 1 only degrades the Classification Head. The classification head predicts the
performance by ∼ 1% in mask AP. It is probably because class of the instance associated with the location. The
our mask heads only predict relatively simple class-agnostic ground-truth target is the instance’s class ci or 0 (i.e., back-
instance masks and most of the information of an instance ground). As in FCOS, the network predicts a C -D vector p x,y
has been encoded in the dynamically generated filters. for the classification and each dimension of p x,y corresponds
As mentioned before, the generated filters can also en- to a binary classifier, where C is the number of categories.
code the shape and position of the target instance. Since Controller Head. The controller head, which has the same
the CNN feature maps do not generally convey the position architecture as the classification head, is used to predict the
information, a map of the coordinates needs to be appended parameters of the conv. filters of the mask head for the
to Fbottom such that the generated filters are aware of posi- instance at the location. The mask head predicts the mask
tions. As the filters are generated with the location-agnostic for this particular instance. This is the core contribution
convolutions, they can only (implicitly) encode the shape of our work. To predict the parameters, we concatenate
and position with the coordinates relative to the location all the parameters of the filters (i.e., weights and biases)
where the filters are generated (i.e., using the coordinate together as an N -D vector θ x,y , where N is the total number
system with the location as the origin). Thus, as shown in of the parameters. Accordingly, the controller head has N
Fig. 3, Fbottom is combined with a map of the relative coor- output channels. The mask head is a very compact FCN
dinates, which are obtained by transforming all the locations architecture, which has three 1×1 convolutions, each having
on Fbottom to the coordinate system with the location gener- 8 channels and using ReLU as the activation function except
ating the filters as the origin. Then, the combination is sent for the last one. No normalization layer such as batch
to the mask head to predict the instance mask in the fully normalization is used here. The last layer has 1 output
convolutional fashion. The relative coordinates provide a channel and uses sigmoid to predict the probability of being
strong cue for predicting the instance mask, as shown in foreground. The mask head has 169 parameters in total
our experiments. It is also interesting to note that even if (#weights = (8+2)×8(conv1)+8×8(conv2)+8×1(conv3)
the generated mask heads only take as input the map of the and #biases = 8(conv1) + 8(conv2) + 1(conv3)). The
relative coordinates, a modest performance can be obtained masks predicted by the mask heads are supervised with the
as shown in the experiments. This empirically proves that ground-truth instance masks, which pushes the controller to
the generated filters indeed encode the shape and position generate the correct filters.
of the target instance. Finally, sigmoid is used as the last Box Head. The box head is the same as that in FCOS, which
layer of the mask head and obtains the mask scores. The predicts a 4-D vector encoding the four distances from the
mask head only classifies the pixels as the foreground or location to the four boundaries of the bounding-box of the
background. The class of the instance is predicted by the target instance. Conceptually, CondInst can eliminate the
classification head of the detector, as shown in Fig. 3. box head since CondInst needs no ROIs. However, we note
The resolution of the original mask prediction is same that if we make use of box-based NMS, the inference time
as the resolution of Fbottom , which is 1/8 of the input image will be much reduced since we only need to compute the
resolution. In order to improve the resolution of instance masks for the instances kept after box NMS. Thus, we still
masks, we use bilinear interpolation to upsample the mask predict boxes in CondInst. We would like to highlight that
prediction by 2, resulting in 200 × 256 instance masks (if the predicted boxes are only used in NMS and do not involve
6

3.3 Loss Functions


FPN
1 conv→2x →conv →2x →conv →2x ! Formally, the overall loss function of CondInst can be for-
" mulated as,
32
1 conv→2x →conv →2x !
" Loverall = Lf cos + λLmask + µLpano , (2)
16
1 conv→2x !
where Lf cos and Lmask denote the original loss of FCOS
"
8
and the loss for instance masks, respectively. Lpano (only
1 conv !
" available in panoptic segmentation) is the loss for the se-
4
conv→4x
mantic branch of panoptic segmentation. λ and µ being 1
CondInst instance seg. outputs and 0.5, respectively, is used to balance these losses. Lf cos
is the same as in FCOS. Specifically, Lf cos includes the clas-
sification head, the box regression head and the center-ness
head, which are trained with the focal loss [46], the GIoU
… Panoptic seg.
Up-sample 4x loss, and the binary cross-entropy (BCE) loss, respectively.
Lmask is defined as,
1 X
1{c∗x,y >0} Ldice Mx,y , M∗x,y ,
 
Lmask ({θθ x,y }) =
Fig. 4 – Illustration of CondInst for panoptic segmentation Npos x,y
by attaching a semantic segmentation branch. The semantic (3)
segmentation branch follows [38]. Results from the instance where c∗x,y is the classification label of location (x, y), which
segmentation and segmentation segmentation branches are
is the class of the instance associated with the location or 0
combined together using the same post-processing as in [18].
(i.e., background) if the location is not associated with any
instance. Npos is the number of locations where c∗x,y > 0.
any ROI operations. Moreover, as shown in Table 6, the box
1{c∗x,y >0} is the indicator function, being 1 if c∗x,y > 0 and
prediction can be removed if other kinds of NMS are used 0 otherwise. M∗x,y ∈ {0, 1}H×W is the ground-truth mask
(e.g., mask NMS [34]). This is fundamentally different from of the instance associated with location (x, y), and Mx,y is
previous ROI-based methods, in which the box prediction is the mask predicted by the dynamic mask head of location
mandatory. (x, y). Formally,
Center-ness Head. Like FCOS [11], at each location, we also Mx,y = M askHead(F̃x,y ; θ x,y ), (4)
predict a center-ness score. The center-ness score depicts
how the location deviates from the center of the target where θ x,y is the generated filters’ parameters at location
instance. In inference, it is used to down-weight the boxes (x, y). F̃x,y ∈ RHbottom ×Wbottom ×(Cbottom +2) is the com-
predicted by the locations far from the center as these boxes bination of Fbottom and a map of coordinates Ox,y ∈
might be unreliable. The ground-truth center-ness score is RHbottom ×Wbottom ×2 . As described before, Ox,y is the relative
defined as coordinates from all the locations on Fbottom to (x, y) (i.e.,
s the location where the filters are generated). M askHead
∗ min(l∗ , r∗ ) min(t∗ , b∗ ) consists of a stack of convolutions with dynamic parameters
centerness = · , (1)
θ x,y .
max(l∗ , r∗ ) max(t∗ , b∗ )
Moreover, Ldice is the Dice loss as in [47], which is used
where l∗ , r∗ , t∗ and b∗ denote the distances from the location to overcome the foreground-background sample imbalance.
to the four boundaries of the ground-truth bounding box. We do not employ focal loss here as it requires to initialize
We used the binary cross entropy (BCE) loss to supervise the biases with a prior probability [46], which is not trivial if
center-ness score as in FCOS. the parameters are dynamically generated. Formally, Ldice
Semantic Branch for Panoptic Segmentation. As men- is defined as
tioned before, we can extend CondInst to panoptic seg- 2 i,j Mi,j M∗i,j
P

mentation by adding a new semantic segmentation branch. Ldice (M, M ) = 1 − P 2
P ∗ 2, (5)
i,j (Mi,j ) + i,j (Mi,j )
For the semantic segmentation branch, we use the struc-
ture from Panoptic-FPN [38]. To be specific, as shown in where Mi,j or M∗i,j denotes the elements of Mx,y or M∗x,y ,
Fig. 4, the semantic segmentation branch takes as inputs and the subscript (x, y) is omitted for clarification. Note
the feature maps {P2 , P3 , P4 , P5 } of FPNs. {P3 , P4 , P5 } are that, in order to compute the loss between the predicted
up-sampled to the same resolution as P2 and the four mask Mx,y and the ground-truth mask M∗x,y , they need to
feature maps are concatenated together. The resolution of have the same sizes. As mentioned before, the resolution of
P2 is 1/4 of the input image, which is also the same as the the predicted mask Mx,y is 1/4 of the ground-truth mask
instance masks predicted by CondInst. Then, it is followed M∗x,y . Thus, we down-sample M∗x,y by 4 to make the sizes
by a 1 × 1 convolution and softmax to obtain the semantic equal. The operation is omitted in Eq. (5) for clarification.
segmentation classification scores. The classification scores By design, all the positive locations on the feature maps
are trained with the cross-entropy loss. In inference, the should be used to compute the mask loss. For the images
semantic segmentation results are merged with the above having hundreds of positive locations, the model would
instance masks to generate the final panoptic segmentation consume a large amount of memory. Therefore, in our
results. The details can be found in Sec. 3.4. preliminary version [48], the positive locations used in
7

computing the mask loss are limited up to 500 per GPU 4 E XPERIMENTS
(i.e., 250 per image and we have two images on one GPU). If We evaluate CondInst on the large-scale benchmark MS-
there are more than 500 positive locations, 500 locations will COCO [43]. Following the common practice [2], [11], [46],
be randomly chosen. In this version, instead of randomly our models are trained with split train2017 (115K images)
choosing the 500 locations, we first rank the locations by the and all the ablation experiments are evaluated on split
scores predicted by the FCOS detector, and then choose the val2017 (5K images). Our main results are reported on the
locations with top scores for each instance. As a result, the test-dev split (20K images).
number of locations per image can be reduced to 64. This
strategy works equally well and further reduces the memory
4.1 Implementation Details
footprint. For instance, using this strategy, the ResNet-50
based CondInst can be trained with 4 1080Ti GPUs. Unless specified, we make use of the following implementa-
Moreover, as shown in YOLACT [1] and BlendMask tion details. Following FCOS [11], ResNet-50 is used as our
[32], during training, the instance segmentation task can backbone network and the weights pre-trained on ImageNet
benefit from a joint semantic segmentation task (i.e., using [49] are used to initialize it. For the newly added layers,
instance masks as semantic labels). Thus, we also conduct we initialize them as in [11]. Our models are trained with
experiments with the joint semantic segmentation task, stochastic gradient descent (SGD) over 8 V100 GPUs for
showing improved performance. However, unless explicitly 90K iterations with the initial learning rate being 0.01 and a
specified, all the experiments in the paper are without the mini-batch of 16 images. The learning rate is reduced by a
semantic segmentation task. If used, the semantic segmen- factor of 10 at iteration 60K and 80K , respectively. Weight
tation loss is added to Loverall . decay and momentum are set as 0.0001 and 0.9, respectively.
Following Detectron2 [3], the input images are resized to
have their shorter sides in [640, 800] and their longer sides
less or equal to 1333 during training. Left-right flipping data
3.4 Inference augmentation is also used during training. When testing, we
do not use any data augmentation and only the scale of the
Instance Segmentation. Given an input image, we forward shorter side being 800 is used. The inference time in this
it through the network to obtain the outputs including clas- work is measured on a single V100 GPU with 1 image per
sification confidence p x,y , center-ness scores, box prediction batch.
t x,y and the generated parameters θ x,y . We first follow the
steps in FCOS to obtain the box detections. Afterwards, box- 4.2 Architectures of the Mask Head
based NMS with the threshold being 0.6 is used to remove
duplicated detections and then the top 100 boxes are used In this section, we discuss the design choices of the mask
to compute masks. Note that each box is also associated head in CondInst. We show that the performance is not
with a group of filters generated by the controller. Let us sensitive to the architectures of the mask head. Our baseline
assume that K boxes remain after the NMS, and thus we is the mask head of three 1 × 1 convolutions with 8 channels
have K groups of generated filters. The K groups of filters (i.e., width = 8). As shown in Table 1 (3rd row), it achieves
are used to produce K instance-specific mask heads. These 35.6% in mask AP. Next, we first conduct experiments by
instance-specific mask heads are applied, in the fashion of varying the depth of the mask head. As shown in Table 1a,
FCNs, to F̃x,y (i.e., the combination of Fbottom and Ox,y ) to apart from the mask head with depth being 1, all other mask
predict the masks of the instances. Since the mask head is heads (i.e., depth = 2, 3 and 4) attain similar performance.
a very compact network (having three 1 × 1 convolutions The mask head with depth being 1 achieves inferior perfor-
with 8 channels and 169 parameters in total), the overhead mance as in this case the mask head is actually a linear map-
of computing masks is extremely small. For example, even ping, which has overly weak capacity and cannot encode
with 100 detections (i.e., the maximum number of detections the complex shapes of the instances. Moreover, as shown in
per image on MS-COCO), only less than 5 milliseconds in Table 1b, varying the width (i.e., the number of the channels)
total are spent on the mask heads, which only adds ∼ 10% does not result in a remarkable performance change either
computational time to the base detector FCOS. In contrast, as long as the width is in a reasonable range. We also note
the mask head of Mask R-CNN has four 3 × 3 convolutions that our mask head is extremely light-weight as the filters
with 256 channels, thus having more than 2.3M parameters in our mask head are dynamically generated. As shown
and taking longer computational time. in Table 1, our baseline mask head only takes 4.5 ms per
100 instances (the maximum number of instances on MS-
Panoptic Segmentation. For panoptic segmentation, we fol-
COCO), which suggests that our mask head only adds small
low [38] to combine instance and semantic results to obtain
computational overhead to the base detector. Moreover, our
the panoptic results. We first rank the instance results from
baseline mask head only has 169 parameters in total. In
CondInst by their confidence scores generated by FCOS.
sharp contrast, the mask head of Mask R-CNN [2] has more
The results with their scores less than 0.45 are discarded.
than 2.3M parameters and takes ∼2.5× computational time
When overlaps occur between the instance masks, the over-
(11.4 ms per 100 instances).
lap areas are attributed to the instance with higher score.
Moreover, the instance that loses more than 40% of its total
area due to the overlap with other higher-score-instances is 4.3 Design Choices of the Bottom Module
discarded. Finally, the semantic results are filled to the areas We further investigate the impact of the bottom module.
that are not occupied by any instance. We first change Cbottom , which is the number of channels
8

depth time AP AP50 AP75 APS APM APL width time AP AP50 AP75 APS APM APL
1 2.2 30.5 52.7 30.7 13.7 32.8 44.9 2 2.5 33.9 55.3 35.8 15.8 37.0 48.6
2 3.3 35.5 56.2 37.9 17.1 38.8 51.2 4 2.6 35.4 56.3 37.4 16.9 38.7 51.2
3 4.5 35.6 56.4 37.9 18.0 38.9 50.8 8 4.5 35.6 56.4 37.9 18.0 39.1 50.8
4 5.6 35.6 56.3 37.8 17.3 38.9 51.0 16 4.7 35.7 56.1 38.1 16.9 39.0 50.8
(a) Varying the depth (width = 8). (b) Varying the width (depth = 3).

TABLE 1 – Instance segmentation results with different architectures of the mask head on the MS-COCO val2017 split. “depth”:
the number of layers in the mask head. “width”: the number of channels of these layers. “time”: the milliseconds that the mask
head takes for processing 100 instances.

Cbottom AP AP50 AP75 APS APM APL


1 34.7 56.0 36.8 16.5 37.9 50.1
2 34.9 55.7 37.2 16.5 38.3 50.6
4 35.5 56.3 37.5 17.8 38.7 50.7
8 35.6 56.4 37.9 18.0 39.1 50.8
16 35.4 56.0 37.5 16.9 38.7 50.9

TABLE 2 – Instance segmentation results by varying the num-


ber of channels of the bottom branch’s output (i.e., Cbottom ) on
the MS-COCO val2017 split. The performance keeps almost
the same if Cbottom is in a reasonable range, which suggests
that CondInst is robust to the design choice.

of the mask branch’s output feature maps (i.e., Fbottom ). As


shown in Table 2, as long as Cbottom is in a reasonable range
(i.e., from 2 to 16), the performance keeps almost the same.
Cbottom = 8 is optimal and thus we use Cbottom = 8 in all
other experiments by default. Fig. 5 – Qualitative results without relative coordinates or
We conduct experiments by varying the input FPN fea- bottom features as inputs to the dynamic mask heads. From
tures of the bottom module. Specifically, we change the FPN top to bottom: only with relative coordinates, only with
feature level from P3 (stride being 8) to P2 (stride being bottom features and with both. We can see that the bottom
features are crucial to the details of the instance masks, and
4) for the bottom module. As shown in Table 4, this can
relative coordinates can help the model distinguish between
improve the mask AP from 35.6% to 36.0% with 20% more different instances.
inference time. Moreover, as mentioned before, before taken
as the input of the mask heads, the bottom module’s output
Fbottom is concatenated with a map of relative coordinates, the generated filters. Thus, CondInst can easily represent
which provides a strong cue for the mask prediction. As any shapes including irregular ones, being much more
shown in Table 3 (2nd row), the performance drops signif- flexible. Moreover, if the bottom features are added, the dy-
icantly if the relative coordinates are removed (35.6% vs. namic filters can produce the details of instance masks. This
31.5%). We also experiment with the absolute coordinates, suggests the generated filters look at the bottom features to
but it cannot largely boost the performance as shown in obtain the details of the instance masks.
Table 3 (32.0%). This is understandable because an instance
segmentation model should be translation-equivalence. Be-
sides, as shown in Table 3 (2rd row), only using the relative 4.5 How Important to Upsample Mask Predictions?
coordinates can also obtain decent performance (31.3% in
mask AP). The qualitative results are shown in 5. As mentioned before, the original mask prediction is up-
sampled and the upsampling is of great importance to
the final performance. We confirm this in the experiment.
4.4 What the Generated Filters Encode? As shown in Table 5, without using the upsampling (1st
It is not straightforward to see what the generated filters row in the table), in this case CondInst can produce the
encode. However, this can be analyzed by varying the mask prediction with 1/8 of the input image resolution,
inputs of the dynamic filters and visualizing the changes which merely achieves 34.6% in mask AP because most
of the results. As shown in 5, it can be noted that if the mask of the details (e.g., the boundary) are lost. If the mask
heads only take the relative coordinates as inputs, our model prediction is upsampled by factor = 2, the performance
is able to obtain the coarse contour of the instance. This can be significantly improved by 1% in mask AP (from
suggests that the generated dynamic filters can attend to the 34.6% to 35.6%). In particular, the improvement on small
target instance according to the relative coordinates, and it objects is large (from 15.6% to 18.0), which suggests that
encodes the contour of the target instance. The generated the upsampling can greatly retain the details of objects.
dynamic filters can be also viewed as a representation of a Increasing the upsampling factor to 4 slightly worsens the
contour. This is different from Mask R-CNN, which attends performance in some metrics, probably due to the relatively
to a target instance by an axis-aligned RoI produced by low-quality annotations of MS-COCO. Therefore, we use
Faster R-CNN, CondInst encodes the instance’s contour into factor = 2 in all other models.
9
w/ abs. coord. w/ rel. coord. w/ Fbottom AP AP50 AP75 APS APM APL AR1 AR10 AR100
X 31.5 53.5 32.0 14.8 34.6 44.8 28.0 43.6 45.6
X 31.3 55.0 31.9 15.6 34.1 44.3 27.1 43.3 45.6
X X 32.0 53.4 32.7 14.6 34.1 47.0 28.7 44.6 46.6
X X 35.6 56.4 37.9 18.0 39.1 50.8 30.3 48.7 51.3

TABLE 3 – Ablation study of the input to the mask head on MS-COCO val2017 split. As shown in the table, without the
relative coordinates, the performance drops significantly from 35.6% to 31.5% in mask AP. Using the absolute coordinates
cannot improve the performance remarkably. In addition, it is worth noting that if the mask head only takes as inputs the
relative coordinates (i.e., no appearance features in this case), CondInst also achieves modest performance.

AP AP50 AP75 APS APM APL NMS AP AP50 AP75 APS APM APL
P3 35.6 56.4 37.9 18.0 39.1 50.8 box 35.6 56.4 37.9 18.0 39.1 50.8
P2 36.0 56.6 38.4 17.6 38.9 51.7 mask 35.6 56.5 37.7 18.0 39.1 50.7

TABLE 4 – Instance segmentation results on MS-COCO TABLE 6 – Instance segmentation results with different
val2017 split by varying the FPN feature level for the bottom NMS algorithms. Mask-based NMS can obtain the same
module. Using P2 has better performance but it increases the overall performance as box-based NMS, which suggests that
inference latency by about 20%. CondInst can eliminate the box detection.

factor resolution AP AP50 AP75 APS APM APL


1 1/8 34.6 55.6 36.4 15.6 38.7 51.7
first time that a new and simpler instance segmentation
2 1/4 35.6 56.4 37.9 18.0 39.1 50.8
4 1/2 35.6 56.2 37.7 16.9 38.8 50.8 method, without any bells and whistles outperforms Mask
R-CNN both in accuracy and speed. CondInst also obtains
TABLE 5 – The instance segmentation results on MS-COCO better performance (35.9% vs. 35.5%) and on-par speed
val2017 split by changing the factor used to upsample the (49ms vs. 49ms) than the well-engineered Mask R-CNN in
mask predictions. “resolution” denotes the resolution ratio of
the mask prediction to the input image. Without the upsam-
Detectron2 (i.e., Mask R-CNN∗ in Table 7). Furthermore,
pling (i.e., factor = 1), the performance drops significantly. with a longer training schedule (e.g., 3×) or a stronger
The similar results are obtained with ratio 2 or 4. backbone (e.g., ResNet-101), a consistent improvement is
achieved as well (37.7% vs. 37.5% with ResNet-50 3×
and 39.1% vs. 38.8% with ResNet-101 3×). Moreover, as
4.6 CondInst without Bounding-box Detection shown in Table 7, with the auxiliary semantic segmentation
Although we still keep the bounding-box detection branch task, the performance can be boosted from 37.7% to 38.6%
in CondInst, it is conceptually feasible to eliminate it if we (ResNet-50) or from 39.1% to 40.0% (ResNet-101), without
make use of the NMS using no bounding-boxes. In this case, increasing the inference time. For fair comparisons, all the
all the foreground samples (predicted by the classification inference time here is measured by ourselves on the same
head) will be used to compute instance masks, and the hardware with the official code.
duplicated masks will be removed by mask-based NMS. We also compare CondInst with the recently-proposed
This is confirmed in Table 6. As shown in the table, by instance segmentation methods. Only with half training
removing the box branch in inference and using the mask- iterations, CondInst surpasses TensorMask [50] by a large
based NMS, similar performance can be obtained to box- margin (37.7% vs. 35.4% for ResNet-50 and 39.1% vs. 37.1%
based NMS (35.6% vs. 35.6% in mask AP). The similar for ResNet-101). CondInst is also ∼ 8× faster than Tensor-
performance of mask and box NMS is probably due to the Mask (49ms vs. 380ms per image on the same GPU) with
fact that the instances of MS-COCO are often less dense. similar performance (37.7% vs. 37.1%). Moreover, CondInst
Also, although with highly-optimized implementation on outperforms YOLACT-700 [1] by a large margin with the
GPUs, mask and box NMS can have similar latency, it is same backbone ResNet-101 (40.0% vs. 31.2% and both with
worth noting that we need to compute the masks for all the the auxiliary semantic segmentation task). Moreover, as
foreground instances before the mask NMS can be applied. shown in Fig. 2, compared with YOLACT-700 and Mask
The underlying detector FCOS often predicts thousands R-CNN, CondInst can preserve more details and produce
of foreground instances, and thus it will take significantly higher-quality instance segmentation results.
longer time to obtain the masks of all the foreground in-
stances. This makes the model with mask NMS significantly 4.8 Real-time Instance Segmentation with CondInst
slower than the one with box NMS (often more than 2 times
slower). We also present a real-time version of CondInst. Following
FCOS [53], the 4× conv. layers in the classification and box
regression towers in FCOS are shared in the real-time mod-
4.7 Comparisons with State-of-the-art Methods els (denoted by “shtw.” in Table 8). Moreover, we reduce
We compare CondInst against previous state-of-the-art the input image from a scale of 800 to 512 during testing,
methods on MS-COCO test-dev split. As shown in Ta- and the FPN levels P6 and P7 are removed since there
ble 7, with 1× learning rate schedule (i.e., 90K iterations), are not many larger objects with the small input images.
CondInst outperforms the original Mask R-CNN by 0.7% In order to compensate for the performance loss due to
(35.3% vs. 34.6%). CondInst also achieves a much faster the smaller input size, we use a more aggressive training
speed than the original Mask R-CNN (49ms vs. 65ms per strategy here. Specifically, the real-time models are trained
image on a single V100 GPU). To our knowledge, it is the for 360K iterations (i.e., 4×) and the shorter side of the
10
method backbone aug. sched. AP (%) AP50 AP75 APS APM APL
Mask R-CNN [2] R-50-FPN 1× 34.6 56.5 36.6 15.4 36.3 49.7
CondInst R-50-FPN 1× 35.3 56.4 37.4 18.2 37.8 46.7
Mask R-CNN∗ R-50-FPN X 1× 35.5 57.0 37.8 19.5 37.6 46.0
Mask R-CNN∗ R-50-FPN X 3× 37.5 59.3 40.2 21.1 39.6 48.3
TensorMask [50] R-50-FPN X 6× 35.4 57.2 37.3 16.3 36.8 49.3
BlendMask w/ sem. [32] R-50-FPN X 3× 37.0 58.9 39.7 17.3 39.4 52.5
CondInst R-50-FPN X 1× 35.9 57.0 38.2 19.0 38.6 46.7
CondInst R-50-FPN X 3× 37.7 58.9 40.3 20.4 40.2 48.9
CondInst w/ sem. R-50-FPN X 3× 38.6 60.2 41.4 20.6 41.0 51.1
Mask R-CNN R-101-FPN X 6× 38.3 61.2 40.8 18.2 40.6 54.1
Mask R-CNN∗ R-101-FPN X 3× 38.8 60.9 41.9 21.8 41.4 50.5
YOLACT-700 [1] R-101-FPN X 4.5× 31.2 50.6 32.8 12.1 33.3 47.1
PolarMask [35] R-101-FPN X 2× 32.1 53.7 33.1 14.7 33.8 45.3
TensorMask R-101-FPN X 6× 37.1 59.3 39.4 17.4 39.1 51.6
SOLO [33] R-101-FPN X 6× 37.8 59.5 40.4 16.4 40.6 54.2
BlendMask∗ w/ sem. R-101-FPN X 3× 39.6 61.6 42.6 22.4 42.2 51.4
SOLOv2 [34] R-101-FPN X 6× 39.7 60.7 42.9 17.3 42.9 57.4
CondInst R-101-FPN X 3× 39.1 60.8 41.9 21.0 41.9 50.9
CondInst w/ sem. R-101-FPN X 3× 40.0 62.0 42.9 21.4 42.6 53.0
CondInst w/ sem. R-101-BiFPN X 3× 40.5 62.4 43.4 21.8 43.3 53.3
CondInst w/ sem. DCN-101-BiFPN X 3× 41.3 63.3 44.4 22.5 43.9 55.2

TABLE 7 – Instance segmentation comparisons with state-of-the-art methods on MS-COCO test-dev. “Mask R-CNN” is the
original Mask R-CNN [2]. “Mask R-CNN∗ ” and “BlendMask∗ ” mean that the models are improved by Detectron2 [3]. “aug.”:
using multi-scale data augmentation during training. “sched.”: the learning rate schedule. 1× is 90K iterations, 2× is 180K
iterations and so on. The learning rate is changed as in [51]. “w/ sem”: using the auxiliary semantic segmentation task.

method backbone sched. FPS AP AP50 AP75


YOLACT-550++ [52] R-50 4.5× 44 34.1 53.3 36.2
validation and testing, respectively. It also has 20K train-
YOLACT-550++ R-101 4.5× 36 34.6 53.8 36.9 ing images with coarse annotations. Following Mask R-
CondInst-RT shtw. R-50 4× 43 36.0 57.0 38.0 CNN [2], we only use the images with fine annotations to
CondInst-RT shtw. DLA-34 4× 47 35.8 56.5 38.0 train our models. All images in Cityscapes have the same
CondInst-RT DLA-34 4× 41 36.3 57.3 38.5
resolution 2048×1024. The performance on Cityscapes is
TABLE 8 – The mask AP and inference speed of the real- also measured with the COCO-style mask AP, which are
time CondInst models on the COCO test-dev data. “shtw.”: the averaged mask AP over ten IoU thresholds from 0.5 to
sharing the conv. towers between the classification and box 0.95.
regression branches in FCOS. Both YOLACT++ and CondInst We follow the training details in Detectron2 [3] to train
use the auxiliary semantic segmentation loss here. As you can
CondInst on Cityscapes. Specifically, the models are trained
see, with the same backbone R-50, CondInst-RT outperforms
YOLACT++ by 1.9% AP with almost the same speed. All for 24K iterations with batch size 8 (1 image per GPU). The
inference time is measured with a single V100 GPU. initial learning rate is 0.01, which is reduced by a factor of
10 at step 18K. Since Cityscapes has relatively fewer images,
following Mask R-CNN, we may initialize the models with
input image is randomly chosen from the range 256 to 608 the weights pre-trained on the COCO dataset if specified.
with step 32. Synchronized BatchNorm (SyncBN) is also Moreover, we use multi-scale data augmentation during
used during training. In the real-time models, following training and the shorter side of the images is sampled in
YOLACT, we enable the extra semantic segmentation loss the range from 800 to 1024 with step 32. In inference, we
by default. only use the original image scale 2048×1024. Additionally,
The performance and inference speed of these real- in order to preserve more details on Cityscapes, we increase
time models are shown in Table 8. As shown in the table, the mask output resolution of CondInst from 1/4 to 1/2
the R-50 based CondInst-RT outperforms the R-50 based resolution of the input image.
YOLACT++ [52] by about 2% AP (36.0% vs. 34.1%) and has The results are reported in Table 9. As shown in the
almost the same speed (43 FPS vs. 44 FPS). By further using table, with the same settings, CondInst generally outper-
a strong backbone DLA-34 [54], CondInst-RT can achieve 47 forms the previous strong baseline Mask R-CNN by more
FPS with similar performance. Furthermore, if we do not than 1% mask AP in all the experiments. On Cityscapes,
share the classification and box regression towers in FCOS, the auxiliary semantic segmentation loss can also improve
the performance can be improved to 36.3% AP with slightly the instance segmentation performance. The results with
longer inference time (41 FPS). the loss are denoted by “w/ sem.” in Table 9. By further
using the complementary techniques such as deformable
convolutions and BiFPN, the performance can be further
4.9 Instance Segmentation on Cityscapes
boosted as expected.
We also conduct the instance segmentation experiments
on Cityscapes [55]. The Cityscapes dataset is designed for
the understanding of urban street scenes. For instance seg- 4.10 Experiments on Panoptic Segmentation
mentation, it has 8 categories, which are person, rider, car, As mentioned before, CondInst can be easily extended to
truck, bus, train, motorcycle, and bicycle. It includes 2975, panoptic segmentation [18] by attaching a new semantic
500 and 1525 images with fine annotations for training, segmentation branch depicted in Fig. 4. Here, we conduct
11
method backbone training data AP [val] AP AP50 person rider car truck bus train mcycle bicycle
Mask R-CNN ResNet-50-FPN train 31.5 26.2 49.9 30.5 23.7 46.9 22.8 32.2 18.6 19.1 16.0
CondInst ResNet-50-FPN train 33.3 28.6 53.5 31.3 23.4 51.7 23.4 36.0 27.3 19.1 16.6
CondInst w/ sem. ResNet-50-FPN train 33.9 28.6 53.1 31.3 24.2 51.9 21.2 35.9 26.5 20.9 17.0
Mask R-CNN ResNet-50-FPN train+COCO 36.4 32.0 58.1 34.8 27.0 49.1 30.1 40.9 30.9 24.1 18.7
CondInst ResNet-50-FPN train+COCO 37.5 33.2 57.2 35.1 27.7 54.5 29.5 42.3 33.8 23.9 18.9
CondInst w/sem. ResNet-50-FPN train+COCO 37.7 33.7 57.7 35.7 28.0 54.8 29.6 41.4 36.3 24.8 18.9
CondInst w/sem. DCN-101-BiFPN train+COCO 39.3 33.9 58.2 35.6 28.1 55.0 32.1 44.2 33.6 24.5 18.6
CondInst w/sem. ResNet-50-FPN train+val+COCO - 34.4 59.6 36.4 28.4 55.3 32.6 43.3 33.9 24.8 20.1
CondInst w/sem. DCN-101-BiFPN train+val+COCO - 35.1 59.0 35.9 28.7 55.4 34.4 45.7 35.5 25.5 19.6

TABLE 9 – Instance segmentation results on Cityscapes val (“AP [val]” column) and test (remaining columns) splits.
“DCN”: using deformable convolutions in the backbones. “+COCO”: fine-tuning from the models pre-trained on COCO.
“train+val+COCO”: using both train and val splits to train the models evaluated on the test split. “w/ sem.”: using the
auxiliary semantic segmentation loss during training as in COCO.

Fig. 6 – Panoptic segmentation results on the COCO dataset (better viewed on screen). Color encodes categories and instances.

method backbone sched. PQ PQth PQst method backbone PQ PQth PQst


CondInst R-50-FPN 1× 42.1 50.4 29.7 Li et al. [37] - 53.8 42.5 62.1
Unifying [56] R-50-FPN - 43.6 48.9 35.6 DeeperLab [36] Xception-71 56.5 - -
CondInst R-50-FPN 3× 44.6 53.0 31.8 Panoptic-FPN [38] R-101-FPN 58.1 52.0 62.5
DeeperLab [36] Xception-71 [57] - 34.3 37.5 29.6 AdaptIS [40] R-50 59.0 55.8 61.3
Panoptic-DeepLab [58] Xception-71 - 39.7 43.9 33.2 UPSNet [60] R-50-FPN 59.3 54.6 62.7
Panoptic-FPN [38] R-101-FPN 3× 40.9 48.3 29.7 Panoptic-DeepLab [58] R-50 59.7 - -
AdaptIS [40] ResNeXt-101 1.7× 42.8 50.1 31.8 Unifying [56] R-50-FPN 61.4 54.7 66.3
Aixal DeepLab [39] Axial-ResNet-L - 43.6 48.9 35.6 Panoptic-FCN [59] R-50-FPN 61.4 54.8 66.6
Panoptic-FCN [59] R-101-FPN 3× 45.5 51.4 36.4 CondInst R-50-FPN 61.7 59.0 63.7
CondInst R-101-FPN 3× 46.1 54.7 33.2
UPSNet [60] DCN-101-FPN 3× 46.6 53.2 36.7 TABLE 11 – Panoptic segmentation on the Cityscapes val
Panoptic-FCN DCN-101-FPN 3× 47.1 53.2 37.8 set. All results are with single-model and single-scale with
Unifying DCN-101-FPN - 47.2 53.5 37.7 no flipping. Here we report comparisons with state-of-the-art
CondInst DCN-101-FPN 3× 47.8 55.8 35.8 methods.
TABLE 10 – Panoptic segmentation on the COCO test-
dev data. All results are with single-model and single-scale
testing. Here we report comparisons with state-of-the-art tation, the pixels in the overlapped region belong to both
methods using various backbones and training schedules
(1× means 90K iterations). CondInst achieves the best results instances, and the ground-truth masks are labeled in such
among the compared methods. a way. Therefore, when we use the instance segmentation
framework for panoptic segmentation, the training targets
of the instance segmentation need to be changed to the
the panoptic segmentation experiments on the COCO 2018 instance annotations in panoptic segmentation accordingly.
dataset. Unless specified, the training and testing details We compare our method with a few state-of-the-art
(e.g., image sizes, the number of iterations and etc.) are the panoptic segmentation methods in Table 10. On the chal-
same as in the instance segmentation task on COCO. lenging COCO test-dev benchmark, we outperform the
Although panoptic segmentation can be viewed as a previous strong baseline Panoptic-FPN [38] by a large mar-
combination of instance segmentation and semantic seg- gin with the same backbone and training schedule (i.e.,
mentation, there is a discrepancy between the ground- from 40.9% to 46.1% in PQ with ResNet-101). Moreover,
truth annotations of the original instance segmentation and compared to AdaptIS [40], which shares some similarity
the instance segmentation task in panoptic segmentation. with us, the ResNet-101 based CondInst achieves dramat-
Panoptic segmentation requires that a pixel in the resulting ically better performance than ResNeXt-101 based AdaptIS
mask has only one label. Therefore if two instances overlap, (46.1% vs. 42.8% PQ). This suggests that using the dynamic
the pixels in the overlapped region will only be assigned to filters here might be more effective than using FiLM [41].
the front instance. However, in the original instance segmen- In addition, compared to the recent methods such as [56]
12

and Panoptic-FCN [59], CondInst also outperforms them [14] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid,
considerably. Some qualitative results are in Fig. 6. We also “Unsupervised scale-consistent depth and ego-motion learning
from monocular video,” in Proc. Advances in Neural Inf. Process.
conduct experiments on the panoptic segmentation task Syst., pp. 35–45, 2019.
of Cityscapes [55], and we follow the training strategy of [15] L. Boominathan, S. Kruthiventi, and R. V. Babu, “Crowdnet: A
Panoptic-FPN [38] on this benchmark. Similar to previous deep convolutional network for dense crowd counting,” in Proc.
works [38], [56], [59], we report the results on the Cityscapes ACM Int. Conf. Multimedia, pp. 640–644, ACM, 2016.
val set. As shown in Table 11, we outperform previous [16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic
filter networks,” in Proc. Advances in Neural Inf. Process. Syst.,
methods on this benchmark as well. pp. 667–675, 2016.
[17] B. Yang, G. Bender, Q. V. Le, and J. Ngiam, “Condconv: Condition-
ally parameterized convolutions for efficient inference,” in Proc.
5 C ONCLUSION Advances in Neural Inf. Process. Syst., pp. 1305–1316, 2019.
[18] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panop-
We have proposed a new and simple instance segmentation tic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
framework, termed CondInst. Unlike previous method such pp. 9404–9413, 2019.
as Mask R-CNN, which employs the mask head with fixed [19] L. Zhang, D. Xu, A. Arnab, and P. H. Torr, “Dynamic graph
message passing networks,” in Proc. IEEE Conf. Comp. Vis. Patt.
weights, CondInst conditions the mask head on instances
Recogn., June 2020.
and dynamically generates the filters of the mask head. [20] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu,
This not only reduces the parameters and computational J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmen-
complexity of the mask head, but also eliminates the ROI tation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4974–4983,
2019.
operations, resulting in a faster and simpler instance seg-
[21] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
mentation framework. To our knowledge, CondInst is the for instance segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt.
first framework that can outperform Mask R-CNN both Recogn., pp. 8759–8768, 2018.
in accuracy and speed, without longer training schedules [22] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask
needed. With simple modifications, CondInst can be ex- scoring R-CNN,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
pp. 6409–6418, 2019.
tended to solve panoptic segmentation and achieve state-of- [23] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully
the-art performance on the challenging COCO dataset. We convolutional networks,” in Proc. Eur. Conf. Comp. Vis., pp. 534–
believe that CondInst can be a strong alternative for both 549, Springer, 2016.
instance and panoptic segmentation. [24] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool,
“Instance segmentation by jointly optimizing spatial embeddings
and clustering bandwidth,” in Proc. IEEE Conf. Comp. Vis. Patt.
Recogn., pp. 8837–8845, 2019.
R EFERENCES [25] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
[1] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: real-time in- to-end learning for joint detection and grouping,” in Proc. Advances
stance segmentation,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9157– in Neural Inf. Process. Syst., pp. 2277–2287, 2017.
9166, 2019. [26] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama,
[2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in and K. P. Murphy, “Semantic instance segmentation via deep
Proc. IEEE Int. Conf. Comp. Vis., pp. 2961–2969, 2017. metric learning,” arXiv: Comp. Res. Repository, 2017.
[3] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detec- [27] M. Bai and R. Urtasun, “Deep watershed transform for in-
tron2.” https://github.com/facebookresearch/detectron2, 2019. stance segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
[4] A. Paszke et al., “PyTorch: An imperative style, high-performance pp. 5221–5229, 2017.
deep learning library,” in Proc. Advances in Neural Inf. Process. Syst., [28] S. Liu, J. Jia, S. Fidler, and R. Urtasun, “Sgn: Sequential grouping
pp. 8024–8035, 2019. networks for instance segmentation,” in Proc. IEEE Int. Conf. Comp.
[5] M. Abadi et al., “TensorFlow: A system for large-scale machine Vis., pp. 3496–3504, 2017.
learning,” in USENIX Symp. Operating Systems Design & Implemen- [29] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox, “Box2pix:
tation, pp. 265–283, 2016. Single-shot instance segmentation by assigning pixels to object
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- boxes,” in Proc. IEEE Intelligent Vehicles Symp., pp. 292–299, IEEE,
works for semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. 2018.
Patt. Recogn., pp. 3431–3440, 2015. [30] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi, “Semi-
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, convolutional operators for instance segmentation,” in Proc. Eur.
“Deeplab: Semantic image segmentation with deep convolutional Conf. Comp. Vis., pp. 86–102, 2018.
nets, atrous convolution, and fully connected CRFs,” IEEE Trans. [31] A. Arnab and P. Torr, “Pixelwise instance segmentation with a
Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017. dynamically instantiated network,” in Proc. IEEE Conf. Comp. Vis.
[8] Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders matter for semantic Patt. Recogn., pp. 441–450, 2017.
segmentation: Data-dependent decoding enables flexible feature [32] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blend-
aggregation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3126– mask: Top-down meets bottom-up for instance segmentation,” in
3135, 2019. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
[9] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, “Knowledge
[33] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “SOLO: Segmenting
adaptation for efficient semantic segmentation,” in Proc. IEEE Conf.
objects by locations,” in Proc. Eur. Conf. Comp. Vis., 2020.
Comp. Vis. Patt. Recogn., pp. 578–587, 2019.
[10] Y. Liu, C. Shu, J. Wang, and C. Shen, “Structured knowledge [34] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2:
distillation for dense prediction,” IEEE Trans. Pattern Anal. Mach. Dynamic and fast instance segmentation,” in Proc. Advances in
Intell., 2020. Neural Inf. Process. Syst., 2020.
[11] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional [35] E. Xie, P. Sun, X. Song, W. Wang, D. Liang, C. Shen, and P. Luo,
one-stage object detection,” in Proc. IEEE Int. Conf. Comp. Vis., “PolarMask: Single shot instance segmentation with polar repre-
pp. 9627–9636, 2019. sentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
[12] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single [36] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang,
monocular images using deep convolutional neural fields,” IEEE V. Sze, G. Papandreou, and L.-C. Chen, “Deeperlab: Single-shot
Trans. Pattern Anal. Mach. Intell., 2016. image parser,” arXiv preprint arXiv:1902.05093, 2019.
[13] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric con- [37] Q. Li, A. Arnab, and P. H. Torr, “Weakly-and semi-supervised
straints of virtual normal for depth prediction,” in Proc. IEEE Int. panoptic segmentation,” in Proc. Eur. Conf. Comp. Vis., pp. 102–
Conf. Comp. Vis., 2019. 118, 2018.
13

[38] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature


pyramid networks,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
pp. 6399–6408, 2019.
[39] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L.-C. Chen,
“Axial-DeepLab: Stand-alone axial-attention for panoptic segmen-
tation,” in Proc. Eur. Conf. Comp. Vis., 2020.
[40] K. Sofiiuk, O. Barinova, and A. Konushin, “Adaptis: Adaptive
instance selection network,” in Proc. IEEE Int. Conf. Comp. Vis.,
pp. 7355–7363, 2019.
[41] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville,
“FiLM: Visual reasoning with a general conditioning layer,” in
Proc. AAAI Conf. Artificial Intell., 2018.
[42] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Be-
longie, “Feature pyramid networks for object detection,” in Proc.
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125, 2017.
[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in Proc. Eur. Conf. Comp. Vis., pp. 740–755, Springer, 2014.
[44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards
real-time object detection with region proposal networks,” in Proc.
Advances in Neural Inf. Process. Syst., pp. 91–99, 2015.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” IEEE Trans.
Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, 2015.
[46] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
dense object detection,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
pp. 2980–2988, 2017.
[47] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolu-
tional neural networks for volumetric medical image segmenta-
tion,” in Proc. Int. Conf. 3D Vision, pp. 565–571, IEEE, 2016.
[48] Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for
instance segmentation,” in Proc. Eur. Conf. Comp. Vis., 2020.
[49] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
agenet: A large-scale hierarchical image database,” in Proc. IEEE
Conf. Comp. Vis. Patt. Recogn., pp. 248–255, Ieee, 2009.
[50] X. Chen, R. Girshick, K. He, and P. Dollár, “Tensormask: A founda-
tion for dense object segmentation,” in Proc. IEEE Int. Conf. Comp.
Vis., pp. 2061–2069, 2019.
[51] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-
training,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 4918–4927, 2019.
[52] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT++: Better real-
time instance segmentation,” IEEE Trans. Pattern Anal. Mach. Intell.,
2019.
[53] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: A simple and strong
anchor-free object detector,” IEEE Trans. Pattern Anal. Mach. Intell.,
2021.
[54] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggre-
gation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2403–2412,
2018.
[55] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset
for semantic urban scene understanding,” in Proc. IEEE Conf.
Comp. Vis. Patt. Recogn., pp. 3213–3223, 2016.
[56] Q. Li, X. Qi, and P. H. S. Torr, “Unifying training and inference
for panoptic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt.
Recogn., pp. 13320–13328, 2020.
[57] F. Chollet, “Xception: Deep learning with depthwise separa-
ble convolutions,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
pp. 1251–1258, 2017.
[58] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam,
and L.-C. Chen, “Panoptic-DeepLab: A simple, strong, and fast
baseline for bottom-up panoptic segmentation,” in Proc. IEEE Conf.
Comp. Vis. Patt. Recogn., 2020.
[59] Y. Li, H. Zhao, X. Qi, L. Wang, Z. Li, J. Sun, and J. Jia, “Fully
convolutional networks for panoptic segmentation,” in Proc. IEEE
Conf. Comp. Vis. Patt. Recogn., 2021.
[60] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urta-
sun, “Upsnet: A unified panoptic segmentation network,” in Proc.
IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8818–8826, 2019.

Authors’ photograph and biography not available at


the time of publication.
14

A PPENDIX A
V ISUALIZATION OF RESULTS
Here we provide some visualization results of our model.
Fig. A7 and Fig. A8 show some segmentation results of our
model on COCO for instance segmentation and panoptic
segmentation, respectively.
Fig. A9 show some results that our do not work very
well for instance segmentation. In some cases, the COCO
annotation is noisy, which may have caused confusion for
our model. For example, for the third example in Fig. A9,
the sailboat is incorrectly annotated. Occlusion in the last
example also caused challenges.
Fig. A10 shows some panoptic results that our model
does not perform well.
15

Fig. A7 – More visualization of instance segmentation results on the COCO dataset (better viewed on screen). Color encodes
categories and instances. Here the model is ResNet-101-DCN with BiFPN.
16

Fig. A8 – More visualization of panoptic segmentation results on the COCO dataset (better viewed on screen). Here the model
is ResNet-101-DCN with standard FPN.
17

Fig. A9 – Some instance segmentation results that our model does not work very well, on the COCO dataset (better viewed
on screen). Left to right: input image, ground-truth labels, model’s predictions. In some cases (e.g., the last two examples), the
ground-truth annotation is incorrect or noisy.
18

Fig. A10 – Some panoptic segmentation results that our model does not work very well, on the COCO dataset (better viewed
on screen). Left to right: input image, ground-truth labels, model’s predictions. On those challenging cases, our model makes
plausible mistakes.

You might also like