CondInst: Dynamic Convolutions for Segmentation
CondInst: Dynamic Convolutions for Segmentation
Abstract—We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional
convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow
the paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to
attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head
of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three
advantages: 1) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping
arXiv:2102.03026v5 [cs.CV] 20 Jan 2022
and feature alignment. 2) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3) Due to
the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv.
layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time
less relevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed
on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that
CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet
Index Terms—Fully convolutional networks, conditional convolutions, instance segmentation, panoptic segmentation
conv
mask head K inant method tackling this challenge is still the two-stage
method such as Mask R-CNN [2], which casts instance
conv
conv
conv
Figure
Fig. 1.
2 –Qualitative
Qualitative comparisons
comparisons with
with other
other methods.
methods. WeWe compare
comparethetheproposed
proposedCondInst
CondInstagainst
againstYOLACT
YOLACT[1] [1]
andand Mask
Mask
R-CNN [2]. Our masks are generally of higher quality (e.g., preserving finer details). Best viewed on screen.
R-CNN [2]. Our masks are generally of higher quality (e.g., preserving more details). Best viewed on screen.
to resize the cropped regions into patches of the same size. a set of fixed convolutional filters as the mask head for
theFor
major difficulty
instance, Maskof applying
R-CNN FCNs
resizes to instance
all the cropped segmen-
regions to ROI operations.
predicting At the
all instances, thefirst glance,
network the idea may
parameters in ournot work
mask
tation
14×14is that the similar
(upsampled image appearance
to 28×28 may requirewhich
using a deconvolution), dif- wellare
head as instance-wise
adapted according masktoheads may incur
the instance to bea large number
predicted.
ferent predictions
restricts the outputbut resolution
FCNs struggle at achieving
of instance this. Foras
segmentation, of network
Inspired parameters
by dynamic provided
filtering networksthat [16]
some and images
CondConvcontain
example, if two persons
large instances A and Bhigher
would require with the similar appear-
resolutions to retain as many as dozens of instances. However, we show that
[17], for each instance, a controller sub-network (see Fig. 3) a
details at the boundary.
ance are in an input image, when predicting the instance dynamically generates the mask head’s
very compact FCN mask head with dynamically-generated filters (conditioned
on the center area of the instance), which is then used
mask Inof computer
A, the FCN vision, the closest task to instance segmenta-
needs to predict B as background w.r.t. filters can already outperform previous ROI-based Mask R-
tion is semantic segmentation, for which fully convolutional to predict the mask of this instance. It is expected that
A,networks
which can(FCNs)
be difficult as they look similar in appearance.
have shown dramatic success [6]–[10].
CNN,
the resulting
network in muchcan
parameters reduced
encodecomputational
the characteristicscomplexity
(e.g.,
Therefore,
FCNs also thehave
ROI operation is used performance
shown excellent to crop the personon manyof per instance
relative thanshape
position, that ofand the appearance)
mask head inofMask this R-CNN.
instance,
interest, e.g., A; and
other per-pixel filter out
prediction B. ranging
tasks Essentially,
frominstance
low-levelseg-
im- We summarize
and only fires on theour main
pixels of contributions
this instance, as follow.
which thus by-
mentation needs two
age processing suchtypes of information:
as denoising, 1) appearance
super-resolution; to mid- passes the difficulty in the standard FCNs. These conditional
information
level taskstosuchcategorize objects;
as optical flowand 2) location
estimation informa-
and contour mask• heads
We attempt to solve
are applied instance
to the whole segmentation
high-resolution from a new
feature
tion to distinguish
detection; multipletasks
and high-level objects belonging
including to single-shot
recent the same maps, perspective.
thus eliminating Tothethisneed
end, forwe
ROI propose the At
operations. CondInst
the firstin-
object detection
category. Almost all [11], monocular
methods rely depth
on ROIestimation
cropping,[12]–[14]
which glance,stance
the idea segmentation
may not work framework, which achieves
well as instance-wise mask im-
and counting [15]. However, almost all the instance segmen- heads proved
may incur instance
a largesegmentation
number ofperformance
network parametersthan exist-
explicitly encodes the location information of instances. In
tation methods based on FCNs1 lag behind state-of-the-art provided ing that
methods somesuch as Mask
images R-CNN
contain while as
as many being faster.
dozens
contrast, CondInst exploits the location information by us- of instances. However, as the mask head filters are only
ROI-based methods. Why do the versatile FCNs perform To our knowledge, this is the first time that a new
ingunsatisfactorily
location/instance-sensitive
on instance convolution
segmentation? filters
Thisasiswell
dueasto asked to predict the mask of only one instance, it largely
relative coordinates that are appended instance segmentation framework outperforms recent
the fact that the FCNs tend to yield to the feature
similar map. for
predictions eases the learning requirement and thus reduces the load
similar image appearance. of the state-of-the-art
filters. As a result, boththein accuracy
mask head andcanspeed.
be extremely
Thus, we advocate a new As a result,
solution thatthe vanilla
uses FCNs
instance-
are incapable of distinguishing individual instances. For
aware FCNs for instance mask prediction. In other words, • CondInst
light-weight. We is fully
will showconvolutional and avoids
that a very compact mask thehead
afore-
example, if two persons A and B with the similar appearance with dynamically-generated filters can already
mentioned resizing operation in many existing meth- outperform
instead
are in of
anusing
input aimage,
standard
whenConvNet with
predicting thea instance
fixed setmaskof previous ROI-based Mask R-CNN. This compact mask head
convolutional filters as the mask head for predicting all in- ods, as CondInst does not rely on ROI operations.
of A, the FCN needs to predict B as background w.r.t. A, also results in much reduced computational complexity per
stances, Without having to resize feature maps leads to high-
which thecan network
be difficultparameters are adapted
as they look similar in according
appearance.to instance than that of the mask head in Mask R-CNN.
theTherefore,
instance toanbe ROIpredicted.
operationInspired
is used byto dynamic filteringof
crop the person We resolution
summarizeinstance
our main masks with more
contributions asaccurate
follow. edges.
networks
interest,[10]
i.e., and CondConv
A; and filter out[11], for each instance,
B. Essentially, a con-
this is the core • Unlike previous methods, in which the filters in its
• We attempt to solve instance segmentation from a
troller sub-network (see Fig. 2) dynamically generates the
operation making the model attend to an instance. mask head are fixed for all the instances once trained,
new perspective that uses dynamic mask heads. This
mask In FCNthisnetwork
work, we advocate (conditioned
parameters a new solution on for
the instance
center the
novelfilters in our
solution mask improved
achieves head are dynamically
instance segmen-gener-
segmentation, termed CondInst. Instead of using ROIs,
area of the instance), which is then used to predict the mask ated
tationand conditionedthan
performance on instances. As the such
existing methods filtersasare
CondInst attends to each instance by using instance-sensitive
of convolution
this instance. It is expected that the network parameters
filters as well as relative coordinates that are
only
MaskaskedR-CNN. to predict
To ourthe mask of only
knowledge, this one instance,
is the first
canappended
encode the characteristics
to the feature maps. of thisSpecifically,
instance, and only fires
unlike Mask ittime
largely
that eases
a newthe learning
instance requirementframework
segmentation and thus re-
onR-CNN,
the pixels of this instance, which thus
which uses a standard convolution networkbypasses the dif-
with duces the load
outperforms of the
recent filters. As both
state-of-the-art a result, the mask
in accuracy
ficulty mentioned above. These conditional mask heads are head can be extremely light-weight, significantly re-
and speed.
applied
1. BytoFCNs,
the whole feature
we mean maps,FCNs
the vanilla eliminating theonly
in [6] that needinvolve
for • CondInst
ducing theisinference
fully convolutional and avoids
time per instance. the afore-
Compared with
convolutions and pooling. mentioned resizing operation used in many existing
2
3
methods, as CondInst does not rely on ROI opera- segmentation with classical watershed transform, and object
tions. Without having to resize feature maps leads to instances can be viewed as the energy basins in the energy
high-resolution instance masks with more accurate map of the watershed transform of an image. SGN [28] uses
edges, as shown in Fig. 2. a sequence of networks to gradually group the raw pixels to
• Since the mask head in CondInst is very compact and line segments, connected components, and finally object in-
light-weight, compared with the box detector FCOS, stances, achieving impressive performance. The single-shot
CondInst needs only ∼10% more computational time Box2Pix [29] solves instance segmentation in the bottom-
(less than 5 milliseconds) to obtain the mask re- up fashion. Novotny et al. [30] propose semi-convolutional
sults of all the instances, even when processing the operators to make FCNs applicable to instance segmen-
maximum number of instances per image (i.e., 100 tation. Arnab et al. [31] propose dynamically instantiated
instances). As a result, the overall inference time is CRF (Conditional Random Field) for instance segmentation,
stable as it almost does not depend on the number of which is able to produce a variable number of instances per
instances in the image. image. To our knowledge, thus far none of these methods
• With an extra semantic segmentation branch, can outperform Mask R-CNN both in accuracy and speed
CondInst can be easily extended to panoptic segmen- on the COCO benchmark dataset.
tation [18], resulting a unified fully convolutional The recent YOLACT [1] and BlendMask [32] may be
network for both instance and panoptic segmenta- viewed as a reformulation of Mask RCNN, which decouples
tion tasks. ROI detection and feature maps used for mask prediction.
• CondInst achieves state-of-the-art performance on Wang et al. developed a simple FCN based instance segmen-
both instance and panoptic segmentation tasks while tation method, which segments the instances by the their
being fast and simple. We hope that CondInst can be locations, showing competitive performance [33], [34]. Po-
a new strong alternative for instance and panoptic larMask [35] developed a new simple mask representation
segmentation tasks, as well as other instance-level for instance segmentation, which extends the bounding box
recognition tasks such as keypoint detection. detector FCOS [11].
Panoptic segmentation. There are two main approaches for
solving this task. The first one is the bottom-up approach. It
2 R ELATED W ORK tackles the task as a semantic segmentation at first and then
Here we review some work that is most relevant to ours. uses clustering/grouping methods to assemble the pixels
Conditional Convolutions/Dynamic filters. Unlike tradi- into individual instances or stuff [36], [37]. The authors of
tional convolutional layers, which have fixed filters once [37] also explore the weakly- or semi-supervised panoptic
trained, the filters of conditional convolutions are condi- segmentation. The second approach is the top-down ap-
tioned on the input and are dynamically generated by an- proach, which is often built on top-down instance segmen-
other network (i.e., a controller). This idea has been explored tation methods. For example, Panoptic-FPN [38] extends
previously in dynamic filter networks [16] and CondConv an additional semantic segmentation branch from Mask R-
[17] mainly for the purpose of increasing the capacity of a CNN and combines the results with the instance segmen-
classification network. DGMN [19] also employs dynamic tation results generated by Mask R-CNN [18]. Moreover,
filters to generate the node-specific filters for message cal- attention based methods recently gain much popularity in
culation, which improves the capacity of the networks and many computer vision tasks, which provide a new approach
thus results in better performance. In this work, we extend to panoptic segmentation. Axial DeepLab [39] used a care-
this idea to generate the mask head’s filters conditioned fully designed module to enable attention to be applied
on each instance, and present a high-performance instance to large-size images for panoptic segmentation. CondInst
segmentation method without the need for ROIs. can easily be applied to panoptic segmentation following
Instance Segmentation. To date, the dominant frame- the top-down approaches. We empirically observe that the
work for instance segmentation is still Mask R-CNN. quality of the instance segmentation results may be the
Mask R-CNN first employs an object detector to detect dominant factor to the final performance. Thus in CondInst,
the bounding-boxes of instances (e.g., ROIs). With these without bells and whistles, by simply applying the same
bounding-boxes, an ROI operation is used to crop the method used by Panoptic-FPN, the panoptic segmentation
features of the instance from the feature maps. Finally, an performance of CondInst is already competitive compared
FCN head is used to obtain the desired instance masks. to the state-of-the-art panoptic segmentation methods.
Many works [20]–[22] with top performance are built on Additionally, AdaptIS [40] recently proposes to solve
Mask R-CNN. Moreover, some works have explored to panoptic segmentation with FiLM [41]. The idea shares
apply the standard FCNs [6] to instance segmentation. some similarity with CondInst in which information about
InstanceFCN [23] may be the first instance segmentation an instance is encoded in the coefficients generated by
method that is fully convolutional. InstanceFCN proposes FiLM. Since only the batch normalization coefficients are
to predict position-sensitive score maps with vanilla FCNs. dynamically generated, AdaptIS needs a large mask head
Afterwards, these score maps are assembled to obtain the to achieve good performance. In contrast, CondInst directly
desired instance masks. Note that InstanceFCN does not encodes them into the conv. filters of the mask head, which
work well with overlapping instances. Others [24]–[26] at- is much more straightforward and efficient. Also, as shown
tempt to first perform image segmentation and then the in experiments, CondInst can achieve much better panoptic
desired instance masks are formed by assembling the pixels segmentation accuracy than AdaptIS, which suggests that
of the same instance. Deep Watershed [27] models instance CondInst is much more effective.
4
head
head
assign to
output instance masks
Bottom module
append mask FCN head …
rel. coord.
conv
conv
conv
Fig. 3 – The overall architecture of CondInst. C3 , C4 and C5 are the feature maps of the backbone network (e.g., ResNet-50).
P3 to P7 are the FPN feature maps as in [11], [42]. Fbottom is the bottom branch’s output, whose resolution is the same as that
of P3 . Following [32], the bottom branch aggregates the feature maps P3 , P4 and P5 . F̃bottom is obtained by concatenating the
relative coordinates to Fbottom . The classification head predicts the class probability p x,y of the target instance at location (x, y),
same as in FCOS. The controller generates the filter parameters θ x,y of the mask head for the instance. Similar to FCOS, there
are also center-ness and box heads in parallel with the controller (not shown in the figure for simplicity). Note that the heads in
the dashed box are repeatedly applied to P3 · · · P7 . The mask head is instance-aware, and is applied to F̃bottom as many times
as the number of instances in the image (refer to Fig. 1).
3 O UR M ETHODS : I NSTANCE AND PANOPTIC pixels of the instance, thus producing the mask prediction
S EGMENTATION WITH C OND I NST of the instance and distinguishing individual instances. We
illustrate the process in Fig. 1. The instance-aware filters
We first present CondInst for instance segmentation, and
are generated by modifying an object detector. Specifically,
then we show how the instance segmentation framework
we add a new controller branch to generate the filters for
can be easily extended to panoptic segmentation by using a
the target instance of each box predicted by the detector,
new semantic branch.
as shown in Fig. 3. Therefore, the number of the dynamic
mask heads is the same as the number of the predicted
3.1 Overall Architecture for Instance Segmentation boxes, which should be the number of the instances in the
Given an input image I ∈ RH×W ×3 , the goal of instance image if the detector works well. In this work, we build
segmentation is to predict the pixel-level mask and the cat- CondInst on the popular object detector FCOS [11] due to
egory of each instance of interest in the image. The ground- its simplicity and flexibility. Also, the elimination of anchor-
truths are defined as {(Mi , ci )}, where Mi ∈ {0, 1}H×W is boxes in FCOS can also save the number of parameters and
the mask for the i-th instance and ci ∈ {1, 2, ..., C} is the the amount of computation.
category. C is 80 on MS-COCO [43]. In semantic segmen-
As shown in Fig. 3, following FCOS [11], we make use
tation, the prediction target of each pixel are well-defined,
of the feature maps {P3 , P4 , P5 , P6 , P7 } of feature pyramid
which is the semantic category of the pixel. In addition, the
networks (FPNs) [42], whose down-sampling ratios are 8,
number of categories is known and fixed. Thus, the outputs
16, 32, 64 and 128, respectively. As shown in Fig. 3, on each
of semantic segmentation can be easily represented with the
feature level of the FPN, some functional layers (in the dash
output feature maps of the FCNs, and each channel of the
box) are applied to make instance-aware predictions. For ex-
output feature maps corresponds to a class. However, in
ample, the class of the target instance and the dynamically-
instance segmentation, the prediction target of each pixel
generated filters for the instance. In this sense, CondInst can
is hard to define because instance segmentation also re-
be viewed as the same as Mask R-CNN, both of which first
quires to distinguish individual instances, but the number
attend to instances in an image and then predict the pixel-
of instances changes in different images. This poses a major
level masks of the instances (i.e., instance-first).
challenge when applying traditional FCNs [6] to instance
segmentation. Moreover, recall that Mask R-CNN employs an object
In this work, our core idea is that for an image with detector to predict the bounding-boxes of the instances in
K instances, K different mask heads will be dynamically the input image. The bounding-boxes are actually the way
generated, and each mask head will contain the characteris- that Mask R-CNN represents instances. Similarly, CondInst
tics of its target instance in their filters. As a result, when employs the instance-aware filters to represent the instances.
the mask is applied to an input, it will only fire on the In other words, instead of encoding the instance information
5
with the bounding-boxes, CondInst implicitly encodes it the input image size is 800 × 1024). The mask’s resolution
with the parameters of the generated dynamic filters, which is much higher than that of Mask R-CNN (only 28 × 28 as
is much more flexible. For example, the dynamic filters can mentioned before).
easily represent the irregular shapes that are hard to be
tightly enclosed by a bounding-box (elaborated in Sec. 4.4).
3.2 Network Outputs and Training Targets
This is one of CondInst’s advantages over the previous ROI-
based methods. Similar to FCOS, each location on the FPN’s feature maps Pi
Besides the detector, as shown in Fig. 3, there is also a either is associated with an instance, thus being a positive
bottom branch, which provides the feature maps (denoted sample, or is considered as a negative sample. The associ-
by Fbottom ) that our generated mask heads take as inputs ated instance and label for each location are determined as
to predict the desired instance mask. The bottom branch follows.
aggregates the FPN feature maps P3 , P4 and P5 . To be Let us consider the feature maps Pi ∈ RH×W ×C and let
specific, P4 and P5 are upsampled to the resolution of P3 s be its down-sampling ratio. As shown in previous works
with bilinear interpolation and added to P3 . After that, [11], [44], [45], a location (x, y) on the feature maps can be
four 3 × 3 convolutions with 128 channels are applied. mapped back onto the input image as (bs/2c+xs, bs/2c+ys).
The resolution of the resulting feature maps is the same as If the mapped location falls in the center region of an
P3 (i.e., 18 of the input image resolution). Finally, another instance, the location is considered to be responsible for
convolutional layer is used to reduce the number of the the instance. Any locations outside the center regions are
output channels Cbottom from 128 to 8, resulting in the labeled as negative samples. The center region is defined as
bottom feature Fbottom . The small output channel reduces the box (cx − rs, cy − rs, cx + rs, cy + rs), where (cx , cy )
the number of the generated parameters. We empirically denotes the mass center of the instance mask, s is the down-
found that using Cbottom = 8 can already achieve good sampling ratio of Pi and r is a constant scalar being 1.5 as
performance, and as shown in our experiments, a larger in FCOS [11]. As shown in Fig. 3, at a location (x, y) on Pi ,
Cbottom here (e.g., 16) cannot improve the performance. Even CondInst has the following output heads.
more aggressively, using Cbottom = 1 only degrades the Classification Head. The classification head predicts the
performance by ∼ 1% in mask AP. It is probably because class of the instance associated with the location. The
our mask heads only predict relatively simple class-agnostic ground-truth target is the instance’s class ci or 0 (i.e., back-
instance masks and most of the information of an instance ground). As in FCOS, the network predicts a C -D vector p x,y
has been encoded in the dynamically generated filters. for the classification and each dimension of p x,y corresponds
As mentioned before, the generated filters can also en- to a binary classifier, where C is the number of categories.
code the shape and position of the target instance. Since Controller Head. The controller head, which has the same
the CNN feature maps do not generally convey the position architecture as the classification head, is used to predict the
information, a map of the coordinates needs to be appended parameters of the conv. filters of the mask head for the
to Fbottom such that the generated filters are aware of posi- instance at the location. The mask head predicts the mask
tions. As the filters are generated with the location-agnostic for this particular instance. This is the core contribution
convolutions, they can only (implicitly) encode the shape of our work. To predict the parameters, we concatenate
and position with the coordinates relative to the location all the parameters of the filters (i.e., weights and biases)
where the filters are generated (i.e., using the coordinate together as an N -D vector θ x,y , where N is the total number
system with the location as the origin). Thus, as shown in of the parameters. Accordingly, the controller head has N
Fig. 3, Fbottom is combined with a map of the relative coor- output channels. The mask head is a very compact FCN
dinates, which are obtained by transforming all the locations architecture, which has three 1×1 convolutions, each having
on Fbottom to the coordinate system with the location gener- 8 channels and using ReLU as the activation function except
ating the filters as the origin. Then, the combination is sent for the last one. No normalization layer such as batch
to the mask head to predict the instance mask in the fully normalization is used here. The last layer has 1 output
convolutional fashion. The relative coordinates provide a channel and uses sigmoid to predict the probability of being
strong cue for predicting the instance mask, as shown in foreground. The mask head has 169 parameters in total
our experiments. It is also interesting to note that even if (#weights = (8+2)×8(conv1)+8×8(conv2)+8×1(conv3)
the generated mask heads only take as input the map of the and #biases = 8(conv1) + 8(conv2) + 1(conv3)). The
relative coordinates, a modest performance can be obtained masks predicted by the mask heads are supervised with the
as shown in the experiments. This empirically proves that ground-truth instance masks, which pushes the controller to
the generated filters indeed encode the shape and position generate the correct filters.
of the target instance. Finally, sigmoid is used as the last Box Head. The box head is the same as that in FCOS, which
layer of the mask head and obtains the mask scores. The predicts a 4-D vector encoding the four distances from the
mask head only classifies the pixels as the foreground or location to the four boundaries of the bounding-box of the
background. The class of the instance is predicted by the target instance. Conceptually, CondInst can eliminate the
classification head of the detector, as shown in Fig. 3. box head since CondInst needs no ROIs. However, we note
The resolution of the original mask prediction is same that if we make use of box-based NMS, the inference time
as the resolution of Fbottom , which is 1/8 of the input image will be much reduced since we only need to compute the
resolution. In order to improve the resolution of instance masks for the instances kept after box NMS. Thus, we still
masks, we use bilinear interpolation to upsample the mask predict boxes in CondInst. We would like to highlight that
prediction by 2, resulting in 200 × 256 instance masks (if the predicted boxes are only used in NMS and do not involve
6
computing the mask loss are limited up to 500 per GPU 4 E XPERIMENTS
(i.e., 250 per image and we have two images on one GPU). If We evaluate CondInst on the large-scale benchmark MS-
there are more than 500 positive locations, 500 locations will COCO [43]. Following the common practice [2], [11], [46],
be randomly chosen. In this version, instead of randomly our models are trained with split train2017 (115K images)
choosing the 500 locations, we first rank the locations by the and all the ablation experiments are evaluated on split
scores predicted by the FCOS detector, and then choose the val2017 (5K images). Our main results are reported on the
locations with top scores for each instance. As a result, the test-dev split (20K images).
number of locations per image can be reduced to 64. This
strategy works equally well and further reduces the memory
4.1 Implementation Details
footprint. For instance, using this strategy, the ResNet-50
based CondInst can be trained with 4 1080Ti GPUs. Unless specified, we make use of the following implementa-
Moreover, as shown in YOLACT [1] and BlendMask tion details. Following FCOS [11], ResNet-50 is used as our
[32], during training, the instance segmentation task can backbone network and the weights pre-trained on ImageNet
benefit from a joint semantic segmentation task (i.e., using [49] are used to initialize it. For the newly added layers,
instance masks as semantic labels). Thus, we also conduct we initialize them as in [11]. Our models are trained with
experiments with the joint semantic segmentation task, stochastic gradient descent (SGD) over 8 V100 GPUs for
showing improved performance. However, unless explicitly 90K iterations with the initial learning rate being 0.01 and a
specified, all the experiments in the paper are without the mini-batch of 16 images. The learning rate is reduced by a
semantic segmentation task. If used, the semantic segmen- factor of 10 at iteration 60K and 80K , respectively. Weight
tation loss is added to Loverall . decay and momentum are set as 0.0001 and 0.9, respectively.
Following Detectron2 [3], the input images are resized to
have their shorter sides in [640, 800] and their longer sides
less or equal to 1333 during training. Left-right flipping data
3.4 Inference augmentation is also used during training. When testing, we
do not use any data augmentation and only the scale of the
Instance Segmentation. Given an input image, we forward shorter side being 800 is used. The inference time in this
it through the network to obtain the outputs including clas- work is measured on a single V100 GPU with 1 image per
sification confidence p x,y , center-ness scores, box prediction batch.
t x,y and the generated parameters θ x,y . We first follow the
steps in FCOS to obtain the box detections. Afterwards, box- 4.2 Architectures of the Mask Head
based NMS with the threshold being 0.6 is used to remove
duplicated detections and then the top 100 boxes are used In this section, we discuss the design choices of the mask
to compute masks. Note that each box is also associated head in CondInst. We show that the performance is not
with a group of filters generated by the controller. Let us sensitive to the architectures of the mask head. Our baseline
assume that K boxes remain after the NMS, and thus we is the mask head of three 1 × 1 convolutions with 8 channels
have K groups of generated filters. The K groups of filters (i.e., width = 8). As shown in Table 1 (3rd row), it achieves
are used to produce K instance-specific mask heads. These 35.6% in mask AP. Next, we first conduct experiments by
instance-specific mask heads are applied, in the fashion of varying the depth of the mask head. As shown in Table 1a,
FCNs, to F̃x,y (i.e., the combination of Fbottom and Ox,y ) to apart from the mask head with depth being 1, all other mask
predict the masks of the instances. Since the mask head is heads (i.e., depth = 2, 3 and 4) attain similar performance.
a very compact network (having three 1 × 1 convolutions The mask head with depth being 1 achieves inferior perfor-
with 8 channels and 169 parameters in total), the overhead mance as in this case the mask head is actually a linear map-
of computing masks is extremely small. For example, even ping, which has overly weak capacity and cannot encode
with 100 detections (i.e., the maximum number of detections the complex shapes of the instances. Moreover, as shown in
per image on MS-COCO), only less than 5 milliseconds in Table 1b, varying the width (i.e., the number of the channels)
total are spent on the mask heads, which only adds ∼ 10% does not result in a remarkable performance change either
computational time to the base detector FCOS. In contrast, as long as the width is in a reasonable range. We also note
the mask head of Mask R-CNN has four 3 × 3 convolutions that our mask head is extremely light-weight as the filters
with 256 channels, thus having more than 2.3M parameters in our mask head are dynamically generated. As shown
and taking longer computational time. in Table 1, our baseline mask head only takes 4.5 ms per
100 instances (the maximum number of instances on MS-
Panoptic Segmentation. For panoptic segmentation, we fol-
COCO), which suggests that our mask head only adds small
low [38] to combine instance and semantic results to obtain
computational overhead to the base detector. Moreover, our
the panoptic results. We first rank the instance results from
baseline mask head only has 169 parameters in total. In
CondInst by their confidence scores generated by FCOS.
sharp contrast, the mask head of Mask R-CNN [2] has more
The results with their scores less than 0.45 are discarded.
than 2.3M parameters and takes ∼2.5× computational time
When overlaps occur between the instance masks, the over-
(11.4 ms per 100 instances).
lap areas are attributed to the instance with higher score.
Moreover, the instance that loses more than 40% of its total
area due to the overlap with other higher-score-instances is 4.3 Design Choices of the Bottom Module
discarded. Finally, the semantic results are filled to the areas We further investigate the impact of the bottom module.
that are not occupied by any instance. We first change Cbottom , which is the number of channels
8
depth time AP AP50 AP75 APS APM APL width time AP AP50 AP75 APS APM APL
1 2.2 30.5 52.7 30.7 13.7 32.8 44.9 2 2.5 33.9 55.3 35.8 15.8 37.0 48.6
2 3.3 35.5 56.2 37.9 17.1 38.8 51.2 4 2.6 35.4 56.3 37.4 16.9 38.7 51.2
3 4.5 35.6 56.4 37.9 18.0 38.9 50.8 8 4.5 35.6 56.4 37.9 18.0 39.1 50.8
4 5.6 35.6 56.3 37.8 17.3 38.9 51.0 16 4.7 35.7 56.1 38.1 16.9 39.0 50.8
(a) Varying the depth (width = 8). (b) Varying the width (depth = 3).
TABLE 1 – Instance segmentation results with different architectures of the mask head on the MS-COCO val2017 split. “depth”:
the number of layers in the mask head. “width”: the number of channels of these layers. “time”: the milliseconds that the mask
head takes for processing 100 instances.
TABLE 3 – Ablation study of the input to the mask head on MS-COCO val2017 split. As shown in the table, without the
relative coordinates, the performance drops significantly from 35.6% to 31.5% in mask AP. Using the absolute coordinates
cannot improve the performance remarkably. In addition, it is worth noting that if the mask head only takes as inputs the
relative coordinates (i.e., no appearance features in this case), CondInst also achieves modest performance.
AP AP50 AP75 APS APM APL NMS AP AP50 AP75 APS APM APL
P3 35.6 56.4 37.9 18.0 39.1 50.8 box 35.6 56.4 37.9 18.0 39.1 50.8
P2 36.0 56.6 38.4 17.6 38.9 51.7 mask 35.6 56.5 37.7 18.0 39.1 50.7
TABLE 4 – Instance segmentation results on MS-COCO TABLE 6 – Instance segmentation results with different
val2017 split by varying the FPN feature level for the bottom NMS algorithms. Mask-based NMS can obtain the same
module. Using P2 has better performance but it increases the overall performance as box-based NMS, which suggests that
inference latency by about 20%. CondInst can eliminate the box detection.
TABLE 7 – Instance segmentation comparisons with state-of-the-art methods on MS-COCO test-dev. “Mask R-CNN” is the
original Mask R-CNN [2]. “Mask R-CNN∗ ” and “BlendMask∗ ” mean that the models are improved by Detectron2 [3]. “aug.”:
using multi-scale data augmentation during training. “sched.”: the learning rate schedule. 1× is 90K iterations, 2× is 180K
iterations and so on. The learning rate is changed as in [51]. “w/ sem”: using the auxiliary semantic segmentation task.
TABLE 9 – Instance segmentation results on Cityscapes val (“AP [val]” column) and test (remaining columns) splits.
“DCN”: using deformable convolutions in the backbones. “+COCO”: fine-tuning from the models pre-trained on COCO.
“train+val+COCO”: using both train and val splits to train the models evaluated on the test split. “w/ sem.”: using the
auxiliary semantic segmentation loss during training as in COCO.
Fig. 6 – Panoptic segmentation results on the COCO dataset (better viewed on screen). Color encodes categories and instances.
and Panoptic-FCN [59], CondInst also outperforms them [14] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid,
considerably. Some qualitative results are in Fig. 6. We also “Unsupervised scale-consistent depth and ego-motion learning
from monocular video,” in Proc. Advances in Neural Inf. Process.
conduct experiments on the panoptic segmentation task Syst., pp. 35–45, 2019.
of Cityscapes [55], and we follow the training strategy of [15] L. Boominathan, S. Kruthiventi, and R. V. Babu, “Crowdnet: A
Panoptic-FPN [38] on this benchmark. Similar to previous deep convolutional network for dense crowd counting,” in Proc.
works [38], [56], [59], we report the results on the Cityscapes ACM Int. Conf. Multimedia, pp. 640–644, ACM, 2016.
val set. As shown in Table 11, we outperform previous [16] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, “Dynamic
filter networks,” in Proc. Advances in Neural Inf. Process. Syst.,
methods on this benchmark as well. pp. 667–675, 2016.
[17] B. Yang, G. Bender, Q. V. Le, and J. Ngiam, “Condconv: Condition-
ally parameterized convolutions for efficient inference,” in Proc.
5 C ONCLUSION Advances in Neural Inf. Process. Syst., pp. 1305–1316, 2019.
[18] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panop-
We have proposed a new and simple instance segmentation tic segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
framework, termed CondInst. Unlike previous method such pp. 9404–9413, 2019.
as Mask R-CNN, which employs the mask head with fixed [19] L. Zhang, D. Xu, A. Arnab, and P. H. Torr, “Dynamic graph
message passing networks,” in Proc. IEEE Conf. Comp. Vis. Patt.
weights, CondInst conditions the mask head on instances
Recogn., June 2020.
and dynamically generates the filters of the mask head. [20] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu,
This not only reduces the parameters and computational J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmen-
complexity of the mask head, but also eliminates the ROI tation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4974–4983,
2019.
operations, resulting in a faster and simpler instance seg-
[21] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
mentation framework. To our knowledge, CondInst is the for instance segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt.
first framework that can outperform Mask R-CNN both Recogn., pp. 8759–8768, 2018.
in accuracy and speed, without longer training schedules [22] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask
needed. With simple modifications, CondInst can be ex- scoring R-CNN,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
pp. 6409–6418, 2019.
tended to solve panoptic segmentation and achieve state-of- [23] J. Dai, K. He, Y. Li, S. Ren, and J. Sun, “Instance-sensitive fully
the-art performance on the challenging COCO dataset. We convolutional networks,” in Proc. Eur. Conf. Comp. Vis., pp. 534–
believe that CondInst can be a strong alternative for both 549, Springer, 2016.
instance and panoptic segmentation. [24] D. Neven, B. D. Brabandere, M. Proesmans, and L. V. Gool,
“Instance segmentation by jointly optimizing spatial embeddings
and clustering bandwidth,” in Proc. IEEE Conf. Comp. Vis. Patt.
Recogn., pp. 8837–8845, 2019.
R EFERENCES [25] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-
[1] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: real-time in- to-end learning for joint detection and grouping,” in Proc. Advances
stance segmentation,” in Proc. IEEE Int. Conf. Comp. Vis., pp. 9157– in Neural Inf. Process. Syst., pp. 2277–2287, 2017.
9166, 2019. [26] A. Fathi, Z. Wojna, V. Rathod, P. Wang, H. O. Song, S. Guadarrama,
[2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in and K. P. Murphy, “Semantic instance segmentation via deep
Proc. IEEE Int. Conf. Comp. Vis., pp. 2961–2969, 2017. metric learning,” arXiv: Comp. Res. Repository, 2017.
[3] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detec- [27] M. Bai and R. Urtasun, “Deep watershed transform for in-
tron2.” https://github.com/facebookresearch/detectron2, 2019. stance segmentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn.,
[4] A. Paszke et al., “PyTorch: An imperative style, high-performance pp. 5221–5229, 2017.
deep learning library,” in Proc. Advances in Neural Inf. Process. Syst., [28] S. Liu, J. Jia, S. Fidler, and R. Urtasun, “Sgn: Sequential grouping
pp. 8024–8035, 2019. networks for instance segmentation,” in Proc. IEEE Int. Conf. Comp.
[5] M. Abadi et al., “TensorFlow: A system for large-scale machine Vis., pp. 3496–3504, 2017.
learning,” in USENIX Symp. Operating Systems Design & Implemen- [29] J. Uhrig, E. Rehder, B. Fröhlich, U. Franke, and T. Brox, “Box2pix:
tation, pp. 265–283, 2016. Single-shot instance segmentation by assigning pixels to object
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net- boxes,” in Proc. IEEE Intelligent Vehicles Symp., pp. 292–299, IEEE,
works for semantic segmentation,” in Proc. IEEE Conf. Comp. Vis. 2018.
Patt. Recogn., pp. 3431–3440, 2015. [30] D. Novotny, S. Albanie, D. Larlus, and A. Vedaldi, “Semi-
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille, convolutional operators for instance segmentation,” in Proc. Eur.
“Deeplab: Semantic image segmentation with deep convolutional Conf. Comp. Vis., pp. 86–102, 2018.
nets, atrous convolution, and fully connected CRFs,” IEEE Trans. [31] A. Arnab and P. Torr, “Pixelwise instance segmentation with a
Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017. dynamically instantiated network,” in Proc. IEEE Conf. Comp. Vis.
[8] Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders matter for semantic Patt. Recogn., pp. 441–450, 2017.
segmentation: Data-dependent decoding enables flexible feature [32] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan, “Blend-
aggregation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3126– mask: Top-down meets bottom-up for instance segmentation,” in
3135, 2019. Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
[9] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, “Knowledge
[33] X. Wang, T. Kong, C. Shen, Y. Jiang, and L. Li, “SOLO: Segmenting
adaptation for efficient semantic segmentation,” in Proc. IEEE Conf.
objects by locations,” in Proc. Eur. Conf. Comp. Vis., 2020.
Comp. Vis. Patt. Recogn., pp. 578–587, 2019.
[10] Y. Liu, C. Shu, J. Wang, and C. Shen, “Structured knowledge [34] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen, “SOLOv2:
distillation for dense prediction,” IEEE Trans. Pattern Anal. Mach. Dynamic and fast instance segmentation,” in Proc. Advances in
Intell., 2020. Neural Inf. Process. Syst., 2020.
[11] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional [35] E. Xie, P. Sun, X. Song, W. Wang, D. Liang, C. Shen, and P. Luo,
one-stage object detection,” in Proc. IEEE Int. Conf. Comp. Vis., “PolarMask: Single shot instance segmentation with polar repre-
pp. 9627–9636, 2019. sentation,” in Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2020.
[12] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single [36] T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang,
monocular images using deep convolutional neural fields,” IEEE V. Sze, G. Papandreou, and L.-C. Chen, “Deeperlab: Single-shot
Trans. Pattern Anal. Mach. Intell., 2016. image parser,” arXiv preprint arXiv:1902.05093, 2019.
[13] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric con- [37] Q. Li, A. Arnab, and P. H. Torr, “Weakly-and semi-supervised
straints of virtual normal for depth prediction,” in Proc. IEEE Int. panoptic segmentation,” in Proc. Eur. Conf. Comp. Vis., pp. 102–
Conf. Comp. Vis., 2019. 118, 2018.
13
A PPENDIX A
V ISUALIZATION OF RESULTS
Here we provide some visualization results of our model.
Fig. A7 and Fig. A8 show some segmentation results of our
model on COCO for instance segmentation and panoptic
segmentation, respectively.
Fig. A9 show some results that our do not work very
well for instance segmentation. In some cases, the COCO
annotation is noisy, which may have caused confusion for
our model. For example, for the third example in Fig. A9,
the sailboat is incorrectly annotated. Occlusion in the last
example also caused challenges.
Fig. A10 shows some panoptic results that our model
does not perform well.
15
Fig. A7 – More visualization of instance segmentation results on the COCO dataset (better viewed on screen). Color encodes
categories and instances. Here the model is ResNet-101-DCN with BiFPN.
16
Fig. A8 – More visualization of panoptic segmentation results on the COCO dataset (better viewed on screen). Here the model
is ResNet-101-DCN with standard FPN.
17
Fig. A9 – Some instance segmentation results that our model does not work very well, on the COCO dataset (better viewed
on screen). Left to right: input image, ground-truth labels, model’s predictions. In some cases (e.g., the last two examples), the
ground-truth annotation is incorrect or noisy.
18
Fig. A10 – Some panoptic segmentation results that our model does not work very well, on the COCO dataset (better viewed
on screen). Left to right: input image, ground-truth labels, model’s predictions. On those challenging cases, our model makes
plausible mistakes.