The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
The One Hundred Layers Tiramisu: Fully Convolutional Densenets For Semantic Segmentation
Simon Jégou1 Michal Drozdzal2,3 David Vazquez1,4 Adriana Romero1 Yoshua Bengio1
1
Montreal Institute for Learning Algorithms 2 École Polytechnique de Montréal
3
Imagia Inc., Montréal, 4 Computer Vision Center, Barcelona
simon.jegou@gmail.com, michal@imagia.com, dvazquez@cvc.uab.es,
arXiv:1611.09326v3 [cs.CV] 31 Oct 2017
adriana.romero.soriano@umontreal.ca, yoshua.umontreal@gmail.com
Abstract
1
thus, are well suited for applications where a single predic- maps with a large number of input filters (from all the lay-
tion per input image is expected (e.g. image classification ers below), resulting in both very large amount of compu-
task). tation and number of parameters. In order to mitigate this
Fully Convolutional Networks (FCNs) [20, 27] were in- effect, we only upsample the feature maps created by the
troduced in the literature as a natural extension of CNNs to preceding dense block. Doing so allows to have a number of
tackle per pixel prediction problems such as semantic im- dense blocks at each resolution of the upsampling path inde-
age segmentation. FCNs add upsampling layers to standard pendent of the number of pooling layers. Moreover, given
CNNs to recover the spatial resolution of the input at the the network architecture, the upsampled dense block com-
output layer. As a consequence, FCNs can process images bines the information contained in the other dense blocks
of arbitrary size. In order to compensate for the resolu- of the same resolution. The higher resolution information
tion loss induced by pooling layers, FCNs introduce skip is passed by means of a standard skip connection between
connections between their downsampling and upsampling the downsampling and the upsampling paths. The details of
paths. Skip connections help the upsampling path recover the proposed architecture are shown in Figure 1. We eval-
fine-grained information from the downsampling layers. uate our model on two challenging benchmarks for urban
Among CNN architectures extended as FCNs for seman- scene understanding, Camvid [2] and Gatech [22], and con-
tic segmentation purposes, Residual Networks (ResNets) firm the potential of DenseNets for semantic segmentation
[11] make an interesting case. ResNets are designed to by improving the state-of-the-art.
ease the training of very deep networks (of hundreds of Thus, the contributions of the paper can be summarized
layers) by introducing a residual block that sums two sig- as follows:
nals: a non-linear transformation of the input and its identity • We carefully extend the DenseNet architecture [13] to
mapping. The identity mapping is implemented by means fully convolutional networks for semantic segmenta-
of a shortcut connection. ResNets have been extended to tion, while mitigating the feature map explosion.
work as FCNs [4, 8] yielding very good results in differ-
ent segmentation benchmarks. ResNets incorporate addi- • We highlight that the proposed upsampling path, built
tional paths to FCN (shortcut paths) and, thus, increase the from dense blocks, performs better than upsampling
number of connections within a segmentation network. This path with more standard operations, such as the ones
additional shortcut paths have been shown not only to im- in [27].
prove the segmentation accuracy but also to help the net-
work optimization process, resulting in faster convergence • We show that such a network can outperform current
of the training [8]. state-of-the-art results on standard benchmarks for ur-
ban scene understanding without neither using pre-
Recently, a new CNN architecture, called DenseNet, was
trained parameters nor any further post-processing.
introduced in [13]. DenseNets are built from dense blocks
and pooling operations, where each dense block is an itera-
tive concatenation of previous feature maps. This architec-
2. Related Work
ture can be seen as an extension of ResNets [11], which per- Recent advances in semantic segmentation have been de-
forms iterative summation of previous feature maps. How- voted to improve architectural designs by (1) improving
ever, this small modification has some interesting implica- the upsampling path and increasing the connectivity within
tions: (1) parameter efficiency, DenseNets are more effi- FCNs [27, 1, 21, 8]; (2) introducing modules to account for
cient in the parameter usage; (2) implicit deep supervision, broader context understanding [36, 5, 37]; and/or (3) en-
DenseNets perform deep supervision thanks to short paths dowing FCN architectures with the ability to provide struc-
to all feature maps in the architecture (similar to Deeply tured outputs [16, 5, 38].
Supervised Networks [18]); and (3) feature reuse, all lay- First, different alternatives have been proposed in the lit-
ers can easily access their preceding layers making it easy erature to address the resolution recovery in FCN’s upsam-
to reuse the information from previously computed feature pling path; from simple bilinear interpolation [10, 20, 1]
maps. The characteristics of DenseNets make them a very to more sophisticated operators such as unpooling [1, 21]
good fit for semantic segmentation as they naturally induce or transposed convolutions [20]. Skip connections from
skip connections and multi-scale supervision. the downsampling to the upsampling path have also been
In this paper, we extend DenseNets to work as FCNs by adopted to allow for a finer information recovery [27]. More
adding an upsampling path to recover the full input reso- recently, [8] presented a thorough analysis on the combina-
lution. Naively building an upsampling path would result tion of identity mapping [11] and long skip connections [27]
in a computationally intractable number of feature maps for semantic segmentation.
with very high resolution prior to the softmax layer. This Second, approaches that introduce larger context to se-
is because one would multiply the high resolution feature mantic segmentation networks include [10, 36, 5, 37]. In
3. Fully Convolutional DenseNets
As mentioned in Section 1, FCNs are built from a down-
sampling path, an upsampling path and skip connections.
Skip connections help the upsampling path recover spa-
tially detailed information from the downsampling path, by
reusing features maps. The goal of our model is to further
exploit the feature reuse by extending the more sophisti-
cated DenseNet architecture, while avoiding the feature ex-
plosion at the upsampling path of the network.
In this section, we detail the proposed model for seman-
tic segmentation. First, we review the recently proposed
DenseNet architecture. Second, we introduce the construc-
tion of the novel upsampling path and discuss its advantages
w.r.t. a naive DenseNet extension and more classical archi-
tectures. Finally, we wrap up with the details of the main
architecture used in Section 4.
3.1. Review of DenseNets
Let x` be the output of the `th layer. In a standard CNN,
x` is computed by applying a non-linear transformation H`
Figure 2. Diagram of a dense block of 4 layers. A first layer is ap-
to the output of the previous layer x`−1
plied to the input to create k feature maps, which are concatenated x` = H` (x`−1 ), (1)
to the input. A second layer is then applied to create another k
features maps, which are again concatenated to the previous fea- where H is commonly defined as a convolution followed by
ture maps. The operation is repeated 4 times. The output of the a rectifier non-linearity (ReLU) and often dropout [30].
block is the concatenation of the outputs of the 4 layers, and thus In order to ease the training of very deep networks,
contains 4 ∗ k feature maps
ResNets [11] introduce a residual block that sums the iden-
tity mapping of the input to the output of a layer. The re-
sulting output x` becomes
[10], an unsupervised global image descriptor is computed x` = H` (x`−1 ) + x`−1 , (2)
added to the feature maps for each pixel. In [36], Recur-
rent Neural Networks (RNNs) are used to retrieve contex- allowing for the reuse of features and permitting the gra-
tual information by sweeping the image horizontally and dient to flow directly to earlier layers. In this case, H is
vertically in both directions. In [5], dilated convolutions defined as the repetition (2 or 3 times) of a block composed
are introduced as an alternative to late CNN pooling layers of Batch Normalization (BN) [14], followed by ReLU and
to capture larger context without reducing the image reso- a convolution.
lution. Following the same spirit, [37] propose to provide Pushing this idea further, DenseNets [13] design a more
FCNs with a context module built as a stack of dilated con- sophisticated connectivity pattern that iteratively concate-
volutional layers to enlarge the field of view of the network. nates all feature outputs in a feedforward fashion. Thus, the
output of the `th layer is defined as
Third, Conditional Random Fields (CRF) have long been
a popular choice to enforce structure consistency to seg- x` = H` ([x`−1 , x`−2 , ..., x0 ]), (3)
mentation outputs. More recently, fully connected CRFs
[16] have been used to include structural properties of the where [ ... ] represents the concatenation operation. In this
output of FCNs [5]. Interestingly, in [38], RNN have been case, H is defined as BN, followed by ReLU, a convolution
introduced to approximate mean-field iterations of CRF op- and dropout. Such connectivity pattern strongly encourages
timization, allowing for an end-to-end training of both the the reuse of features and makes all layers in the architec-
FCN and the RNN. ture receive direct supervision signal. The output dimension
of each layer ` has k feature maps where k, hereafter re-
Finally, it is worth noting that current state-of-the-art ferred as to growth rate parameter, is typically set to a small
FCN architectures for semantic segmentation often rely on value (e.g. k = 12). Thus, the number of feature maps in
pre-trained models (e.g. VGG [29] or ResNet101 [11]) to DenseNets grows linearly with the depth (e.g. after ` layers,
improve their segmentation results [20, 1, 4]. the input [x`−1 , x`−2 , ..., x0 ] will have ` × k feature maps).
A transition down is introduced to reduce the spatial di- Therefore, our upsampling path approach allows us to
mensionality of the feature maps. Such transformation is build very deep FC-DenseNets without a feature map explo-
composed of a 1×1 convolution (which conserves the num- sion. An alternative way of implementing the upsampling
ber of feature maps) followed by a 2 × 2 pooling operation. path would be to perform consecutive transposed convolu-
In the remainder of the article, we will call dense block tions and complement them with skip connections from the
the concatenation of the new feature maps created at a given downsampling path in a U-Net [27] or FCN-like [20] fash-
resolution. Figure 2 shows an example of dense block con- ion. This will be further discussed in Section 4
struction. Starting from an input x0 (input image or output
of a transition down) with m feature maps, the first layer of 3.3. Semantic Segmentation Architecture
the block generates an output x1 of dimension k by applying In this subsection, we detail the main architecture, FC-
H1 (x0 ). These k feature maps are then stacked to the pre- DenseNet103, used in Section 4.
vious m feature maps by concatenation ([x1 , x0 ]) and used
First, in Table 1, we define the dense block layer, tran-
as input to the second layer. The same operation is repeated
sition down and transition up of the architecture. Dense
n times, leading to a new dense block with n × k feature
block layers are composed of BN, followed by ReLU, a
maps.
3 × 3 same convolution (no resolution loss) and dropout
3.2. From DenseNets to Fully Convolutional with probability p = 0.2. The growth rate of the layer is set
DenseNets to k = 16. Transition down is composed of BN, followed
by ReLU, a 1 × 1 convolution, dropout with p = 0.2 and a
The DenseNet architecture described in Subsection 3.1 non-overlapping max pooling of size 2 × 2. Transition up
constitutes the downsampling path of our Fully Convolu- is composed of a 3 × 3 transposed convolution with stride 2
tional DenseNet (FC-DenseNet). Note that, in the down- to compensate for the pooling operation.
sampling path, the linear growth in the number of features Second, in Table 2, we summarize all Dense103 layers.
is compensated by the reduction in spatial resolution of each This architecture is built from 103 convolutional layers : a
feature map after the pooling operation. The last layer of the first one on the input, 38 in the downsampling path, 15 in
downsampling path is referred to as bottleneck. the bottleneck and 38 in the upsampling path. We use 5
In order to recover the input spatial resolution, FCNs in- Transition Down (TD), each one containing one extra con-
troduce an upsampling path composed of convolution, up- volution, and 5 Transition Up (TU), each one containing a
sampling operations (transposed convolutions or unpooling transposed convolution. The final layer in the network is a
operations) and skip connections. In FC-DenseNets, we 1 × 1 convolution followed by a softmax non-linearity to
substitute the convolution operation by a dense block and provide the per class distribution at each pixel.
an upsampling operation referred to as transition up. Tran- It is worth noting that, as discussed in Subsection 3.2, the
sition up modules consist of a transposed convolution that proposed upsampling path properly mitigates the DenseNet
upsamples the previous feature maps. The upsampled fea- feature map explosion, leading to reasonable pre-softmax
ture maps are then concatenated to the ones coming from feature map number of 256.
the skip connection to form the input of a new dense block. Finally, the model is trained by minimizing the pixel-
Since the upsampling path increases the feature maps spa- wise cross-entropy loss.
tial resolution, the linear growth in the number of features
would be too memory demanding, especially for the full
4. Experiments
resolution features in the pre-softmax layer.
In order to overcome this limitation, the input of a dense We evaluate our method on two urban scene understand-
block is not concatenated with its output. Thus, the trans- ing datasets: CamVid [2], and Gatech [22]. We trained our
posed convolution is applied only to the feature maps ob- models from scratch without using any extra-data nor post-
tained by the last dense block and not to all feature maps processing module. We report the results using the Inter-
concatenated so far. The last dense block summarizes the section over Union (IoU) metric and the global accuracy
information contained in all the previous dense blocks at (pixel-wise accuracy on the dataset). For a given class c,
the same resolution. Note that some information from ear- predictions (oi ) and targets (yi ), the IoU is defined by
lier dense blocks is lost in the transition down due to the P
pooling operation. Nevertheless, this information is avail- (oi == c ∧ yi == c)
IoU (c) = Pi , (4)
able in the downsampling path of the network and can be i (oi == c ∨ yi == c)
passed via skip connections. Hence, the dense blocks of the
upsampling path are computed using all the available fea- where ∧ is a logical and operation, while ∨ is a logical or
ture maps at a given resolution. Figure 1 illustrates this idea operation. We compute IoU by summing over all the pixels
in detail. i of the dataset.
Transition Down (TD)
Layer
Batch Normalization Transition Up (TU)
Batch Normalization
ReLU
ReLU 3 × 3 Transposed Convolution
1 × 1 Convolution
3 × 3 Convolution stride = 2
Dropout p = 0.2
Dropout p = 0.2
2 × 2 Max Pooling
Table 1. Building blocks of fully convolutional DenseNets. From left to right: layer used in the model, Transition Down (TD) and Transition
Up (TU). See text for details.
Global accuracy
Pedestrian
Pretrained
Mean IoU
Sidewalk
Building
Cyclist
Fence
Road
Sign
Tree
Pole
Sky
Car
Model
SegNet [1] ! 29.5 68.7 52.0 87.0 58.5 13.4 86.2 25.3 17.9 16.0 60.5 24.8 46.4 62.5
Bayesian SegNet [15] ! 29.5 n/a 63.1 86.9
DeconvNet [21] ! 252 n/a 48.9 85.9
Visin et al. [36] ! 32.3 n/a 58.8 88.7
FCN8 [20] ! 134.5 77.8 71.0 88.7 76.1 32.7 91.2 41.7 24.4 19.9 72.7 31.0 57.0 88.0
DeepLab-LFOV [5] ! 37.3 81.5 74.6 89.0 82.2 42.3 92.2 48.4 27.2 14.3 75.4 50.1 61.6 −
Dilation8 [37] ! 140.8 82.6 76.2 89.0 84.0 46.9 92.2 56.3 35.8 23.4 75.3 55.5 65.3 79.0
Dilation8 + FSO [17] ! 140.8 84.0 77.2 91.3 85.6 49.9 92.5 59.1 37.6 16.9 76.0 57.2 66.1 88.3
Classic Upsampling 7 20 73.5 72.2 92.4 66.2 26.9 90.0 37.7 22.7 30.8 69.6 25.1 55.2 86.8
FC-DenseNet56 (k=12) 7 1.5 77.6 72.0 92.4 73.2 31.8 92.8 37.9 26.2 32.6 79.9 31.1 58.9 88.9
FC-DenseNet67 (k=16) 7 3.5 80.2 75.4 93.0 78.2 40.9 94.7 58.4 30.7 38.4 81.9 52.1 65.8 90.8
FC-DenseNet103 (k=16) 7 9.4 83.0 77.3 93.0 77.3 43.9 94.5 59.6 37.1 37.8 82.2 50.5 66.9 91.5
Table 3. Results on CamVid dataset. Note that we trained our own pretrained FCN8 model