Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting
from any ideas, methods, instructions, or products referred to in the content.
1
A Study of Deep Learning Model Performance
for Remote Sensing Image Segmentation
Teodora Selea, Gabriel Iuhasz and Marian Neagul
Abstract—Deep Learning is an extremely important research
topic in Earth Observation. Current use-cases range from semantic image segmentation, object detection to more common
problems found in computer vision such as object identification.
Earth Observation is an excellent source for different types
of problems and data for Machine Learning in general and
Deep Learning in particular. It can be argued that both Earth
Observation and Deep Learning as fields o f research w ill benefit
greatly from this recent trend of research.
In this paper we take several state of the art Deep Learning
network topologies and provide a detailed analysis of their performance for semantic image segmentation for building footprint
detection. The dataset used is comprised of high resolution images
depicting urban scenes. We focused on single model performance
on simple RGB images.
In most situations several methods have been applied to increase
the accuracy of prediction when using deep learning such as
ensembling, alternating between optimisers during training and
using pretrained weights to bootstrap new models. These methods
although effective, are not indicative of single model performance.
Instead, in this paper, we present different topology variations of
these state of the art topologies and study how these variations
effect both training convergence and out of sample, single model,
performance.
Index Terms—Deep learning; convolutional neural networks;
remote sensing
I. I NTRODUCTION
Ever since the breakthrough of AlexNet [1] at the ImageNet
classification challenge, Convolutional neural networks (CNN)
dominate the field o f c omputer v ision. S ince t hen, several
new CNN architectures appeared, like VGG [2], ResNet [3]
or Inception [4] that serve as base models for the further
development of CNN topologies. Current computer vision
tasks include image classification, s emantic s egmentation and
instance segmentation. In image classification, one class label
is attached to every image. As opposed to this, in semantic
segmentation we aim to classify each individual pixel from an
image and therefore identify pixels that belong to the same
object.
The continuously increasing volume of acquired remote
sensing data, generated the need of an automatic method
for information extraction. This is also true in the case
of semantic image segmentation. Valuable insight may be
obtained, regarding land cover, urban planning, geographic
mapping, change detection, etc. In this paper, we focus on
the problem of semantic segmentation applied on satellite or
aerial imagery. We analyze and compare different semantic
T. Selea, G. Iuhasz and M. Neagul are with Institute e-Austria, Timisoara,
Romania and with Faculty of Mathematics and Informatics, West University
of Timisoara, Romania
segmentation architectures and their behaviour on satellite
image data.
In this article we focus on single model predictive performance. We compare 4 state-of-the-art deep learning topologies
used for semantic image segmentation. Our goal is to modify
each individual topology so that we maximize the overall predictive performance. We also provide an implementation for
our best performing models, along with their trained weights.
In section II we present an overview of the current state of
the art for the semantic image segmentation domain, focusing
on the aforementioned topologies. In section III we start by
describing the datasets and our experimental methodology
related to data ingestion and training parameters. Section IV
contains details of the topological modification of the deep
learning models as well as the results of each experiment.
Finally, section V contains our conclusions and plans for future
work.
II. E ARTH O BSERVATION AND D EEP L EARNING
The increasing interest on deep learning techniques applied
to remote sensing data may be seen in the amount of publication on the topic from the past few years [5].
Semantic segmentation task was also the topic of several earth observation challenges on Kaggle1 , CrowdAI2 ,
Spacenet34 [6], Deepglobe [7], ISPRS [8]. The labeled datasets
released by these competitions vary, starting from the objects
aimed to be identified to the types of data provided (RGB,
IRRG, RGBIR, DSM). In this paper, we focus on the building
detection problem, using only RGB data, provided by the
Urban 3D [9] challenge, due to the considerable size of the
training and testing sets. For the testing phase our results are
also validated against the ISPRS Potsdam dataset.
Semantic segmentation in Earth Observation
Fully Convolutional Networks (FCN) [10] were designed
to provide a fast and accurate solution to the problem of
semantic segmentation. They are based on state of the art
classification networks that provide a dense pixel prediction.
They build on the idea of replacing all dense, fully connected
layers with convolutional layers, adding a 1 × 1 convolution
with the channel dimension equal to the number of classes to
predict. Finally, a deconvolutional layer (backwards convolution) is appended to upsample the result to its initial size. The
1 https://www.kaggle.com/c/planet-understanding-the-amazon-from-space
2 https://www.crowdai.org/challenges/mapping-challenge
3 https://spacenetchallenge.github.io/Competitions/Competition2.html
4 https://spacenetchallenge.github.io/datasets/datasetHomePage.html
© 2022 by the author(s). Distributed under a Creative Commons CC BY license.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
2
authors also introduce skip connections, as a way to refine the
prediction. Skip connections enable feature fusion, combining
information extracted by the lower layers using finer strides
with the features extracted by the final coarse layers.
U-Net [11] is a semantic segmentation network based on
FCN and upon the principle of skip connections and feature
fusion. Another distinction when compared with FCN, is
the increased number of feature channels in the upsampling
part. The network is composed of a contractive part and an
expansive part, having a similar number of feature channels
in both parts, therefore resulting in a ”U” shaped network.
The feature extraction part downsamples the data and doubles the number of channels at each step. The upsampling
divides the number of channels by a factor of two. Skip
connections are made between each convolution block that
extracted the features and the corresponding upsampling layer,
fusing features at different resolutions, just like in FCN. The
network uses concatenation to merge together the corresponding features.Although initially designed for bio-medical image
segmentation, U-Net was successfully used on remote sensing
images.
In [12] the authors separately trained a U-Net model for
identifying each class from their dataset, in combination
with augmenting techniques and reflectance indices. U-net
architecture was also used by the winner of the Kaggle DSTL5
competition.
W-net [13] is a medical image segmentation network, build
from two U-nets, put together side by side. The authors
investigate the use of activation function and different methods
of bridging the two U-Net. One particularity of W-net is the
mixing between two different activation functions ReLU and
ELU, that provides better results as opposed to using only
one of the activation functions. ELU activation is mainly
used on the edges of the ”U” shaped network and as it gets
saturated when the network goes deeper, it is replaced by
ReLU in the middle part of the network. The authors also
analyze concatenation versus addition as bridging methods and
conclude in favor of concatenation.
Segnet [14] is another popular semantic segmentation network, originally designed for road scene understanding. Just
like U-Net, Segnet has an encoder-decoder structure, one for
extracting features and the other for re-sampling the image
to its original size. The encoding part of Segnet is based
on the VGG [2] network, by removing the fully connected
layers from the end of the network. Segnet’s main contribution
relies in the upsampling part of the network. The authors use
knowledge from the encoding layers to refine the upsampling,
but instead of transferring the whole feature maps as in UNet, they propose to only use the pooling indices. By doing
so, Segnet obtains a gain in the memory usage. Just like U-Net,
Segnet also includes convolutional layers in the decoder part,
in order to make the features maps obtained by the upsampling
layers denser.
Bayesian Segnet [15] is an enhanced version of the original
Segnet, where the authors improved the network by adding
dropout [16] as a regularization technique. The authors inves5 https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection
tigated different configurations with the dropout layer: after every convolutional layer, only after the encoder/decoder blocks
or combined, settling for a configuration using a dropout
probability of 0.5 inserted after the max-pooling layer after
the three innermost encoder blocks.
Segnet was used on the ISPRS dataset [17], in combination
with an edge detection network (HED) [18]. Two networks are
trained individually, one with color channels and the other with
DEM data and their results are fused by concatenation at the
end. The authors introduce HED in the beginning part of the
training process, to extract boundary information of the objects
from the input images. Then, the edge information is appended
as an addition channel to both semantic segmentation models.
An improvement is also obtained by using multi-scale training
and use of ensemble models. Apart from Segnet, a FCN
network [19] is used in the process, also trained with boundary
information. In the end, the model results are averaged before
the final prediction.
HSN [20] is a lightweight network, designed on the encoderdecoder principles, with few trainable parameters. The network
was developed for semantic segmentation on satellite images,
for recognizing objects like buildings, trees, low vegetation
and cars. HSN combines simple convolutional layers with
residual blocks [3], deconvolution and inception modules [4].
The Inception modules allow multi-scale inference and are
used both in the encoder and decoder part of the network. Skip
connections are also used to transfer knowledge towards the
upsampling layers, but (opposed to the previously mentioned
networks) the feature maps are first passed through residual
blocks and then fused in the decoding step. Deconvolutional
layers are used to gradually upsample the feature maps, followed by simple convolutional layers, similar to other encoderdecoder networks.
An accuracy gain may be obtained not only by a careful
choice of activation functions, but also from using regularization techniques like batch normalization [21] or Dropout [16].
In [22] the authors analyze the interaction of batch normalization layers among activation layers for classification networks trained on ImageNet dataset. The authors recommend
using ELU without batch normalization or ReLU with batch
normalization technique. The position of batch normalization
before or after activation is also discussed, however, the results
are inconsistent and therefore we aim to study the behavior of
semantic segmentation networks in both cases.
Earth Observation segmentation data sets
Data sets for semantic image segmentation are gaining
popularity being almost on par with regards to quantity as
the ones for image classification. This trend has also been
observed in the case of data sets created for semantic image
segmentation meant for satellite imagery.
The Dstl6 satellite imagery feature detection data set (segmentation) provides 1km×1km scenes in both 3 bands (RGB)
and 16 bands (multispectral and shortwave infrared). The RGB
resolution is 0.31m while the multispectral and shortwave
infrared is 1.24m and 7.5m respectively. The ground truth
6 https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection/data
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
3
contains 10 classes of man made objects (ranging from building to small an large vehicles) to naturally occurring objects
such as waterways and trees. The total data set size is 20GB
(GeoTiff lossless format) and the evaluation for the kaggle
competition used the average Jaccard index.
The SpaceNet7 data sets are semantic segmentation data
sets for both building as well as roads. The area of interest
covers various urban areas (Rio, Vegas, Paris, Shanghai and
Khartoum) and is comprised of 3 band RGB and 8 band
multispectral images provided by DigitalGlobe. The spatial
resolution ranges from 30cm to 50cm the scene size also
varies based on the type of segmentation it being approximately 400m×400m for road segmentation and 200m×200m
for building segmentation. The total size of the available data
sets combining all areas is over 110GB.
Another interesting dataset from SpaceNet is the so called
SpaceNet Off-Nadir8 data set. It contains 120.000 building
footprints in 27 450m × 450m scenes. These scenes are split
up into 3 categories based on their nadir value defined as: nadir
0 − 25 degrees, off-nadir 26 − 40 degrees and very off-nadir
40 − 55 degrees. The goal of this data set was to see how well
off-nadir images can be processed using current state of the
art methods in earth observation. Total size of the data set is
approximately 186GB.
The Inria aerial image labeling data set9 contains 180 color
images covering 1500m × 1500m each at 30cm resolution.
There are 36 tiles in total for several regions (Austin, Chicago,
Kitsap County, Western Tyrol, Vienna).
In the case of land and crop cover segmentation data sets
we also have several valuable data sets. Land Cover data set
of Slovenia10 created by Sinergise comprised of a temporal
stack of Sentinel 2 hyperspectral images at 10m resolution.
The ground truth contains 10 classes, the total size of the
data set is approximately 187GB. The agricultural crop cover
challenge11 uses Landsat 8 images at a resolution of 30m to
classify corn and soybean coverage. There is a temporal stack
of images covering a period of 6 years for a total of 32 scenes
(25% used for validation).
III. E XPERIMENTS
As we have discussed in the previous section there are a
wide range of annotated earth observation datasets geared towards the semantic segmentation problem. Most of the current
work being done is to maximize by any means necessary the
accuracy score achieved. This is mostly done using several
models and/or transfer learning methods. For the experiments
we have chosen the dataset from the Urban3D challenge. In
this section we discuss the details of our experiments.
Urban 3D Dataset: The USSOCOM Urban3D Challenge
[9] (sample in Figure 1)released a large-scale remote sensing
dataset, containing orthorectified 2D data and 3D Digital
7 https://spacenetchallenge.github.io/datasets/datasetHomePage.html
8 https://spacenetchallenge.github.io/datasets/spacenet-OffNadirsummary.html
9 https://project.inria.fr/aerialimagelabeling/contest/
10 http://eo-learn.sentinel-hub.com/
11 https://www.crowdanalytix.com/contests/agricultural-crop-coverclassification-challenge
Surface Model images, with annotated building footprints. The
goal of this competition was to improve automated building
footprint extraction from satellite imagery. Reducing the currently significant manual effort for obtaining high geospatial
accuracy.
The dataset is composed of 174 training scenes and 62 testing scenes, each of 2048 × 2048 pixel resolution, obtained at
approximately 0.5 meters ground sample distance. The dataset
may be downloaded from the Spacenet repository mentioned in
section II which contains images from Jacksonville, Florida,
USA; Tampa, Florida, USA; and Richmond, Virginia, USA
totaling approximately 157, 000 building footprints.
During the initial competition the question of the ground
truth accuracy was raised. Some of the provided scene ground
truths had missing or superfluous building footprints. Initial
estimates put the corruption of the ground truth to as high as
5-10%. However, it has been found that the error is closer to
1-2% at worst in the training set while the testing set ground
truth has been rectified.
ISPRS Potsdam Dataset: The ISPRS Potsdam dataset [8]
is a benchmark dataset released by ISPRS, consisting of
orthorectified 2D data and a 3D Digital Surface Model. The
2d is composed of 4 different channels: Red (R), Greed (G),
Blue (B) and Infra Red (IR).
The dataset is composed out of 38 patches, based on a
ground sampling distance of 5cm.
Scoring: The scores presented in this paper are computed
using the F 1 score as seen in equation 2 which is based on
the true positive (TP), true negative (TN), false positive (FP),
false negative (FN). Our experiments are based solely on the
provided RGB imagery.
P recision =
TP
TP + FP
Recall =
TP
TP + FN
P recision ∗ Recall
(P recision + Recall)
(2)
1 − po
p o − pe
=1−
(3)
κ≡
1 − pe
1 − pe
1 X
TP + TN
nki (5)
p
=
e
po =
(4)
N2
N
C
(1)
F 1 score = 2 ∗
The Cohen Kappa coefficient is a statistical score which
measures the inter-agreement. It takes into account also the
possibility of the agreement occurring by chance. We can
see that in equation 3 po represents the observed agreement
(accuracy, equation 4), while pe (equation 5) represents the
hypothetical probability of chance agreements, where N is
the total number of samples, C number of classes and nki
is the number of times rater i predicted class C. Although we
calculate this coefficient we use the F1 score as the deciding
factor for the best performing models.
Training settings: Like many other such challenges, the
Urban3D competition imposed no limitation on model size
or the number of models that contribute to a prediction. This
of course is not in itself a bad thing however, in our view it
is not representative of the performance of individual models.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
4
η
m̂t (10)
vˆt + ǫ
In [23] the authors propose the default values of 0.9 for β1 ,
0.999 for β2 and 10−8 for ǫ. We used a starting learning rate
η of 0.0001 leaving all of the other parameters unchanged.
In order to facilitate the convergence of the models we also
utilized a method that reduces the learning rate. It checks after
each epoch if the validation loss has decreased. If it finds
that after 3 epochs the loss stagnates or is increasing it will
decrease the learning rate by a factor of 0.1 from the current
value.
To prevent overfitting we used early stopping based on
the validation loss. Training was stopped when the model
validation loss was decreasing for 10 consecutive epochs.
Considering the size of the input images (2048 × 2048) and
the memory limitations of the accelerator hardware, we are
using an sliding window mechanism (as depicted in Figure 2)
where each of the input scenes is divided in multiple, partially
overlapping, tiles of reduced size (256 × 256). We chose to
overlap the tiles (using a fixed stride size) in an attempt to
compensate for dimensionality reduction induced by splitting
the input.
mt = β1 mt−1 + (1 − β1 )gt
vt = β2 vt−1 + (1 −
β2 )gt2
(6)
(7)
We can see from equations 6 and 7 that both mt and vt can
be biased towards 0 when the decay rates β1 and β2 are close
to 1. The biased corrected first and second momentum are:
vt
mt
vˆt =
(9)
(8)
m̂t =
t
1
−
β2t
1 − β1
The biased corrected gradient calculations from 8 and 9 are
then used to update the parameters Θ as follows:
12 https://keras.io/
13 https://www.tensorflow.org
Scene height
Furthermore, it can lead to a pay-to-win type scenario where
the team that has access to the more powerful infrastructure
can train a variety of predictive models creating at the end an
ensemble prediction.
Other solutions used a variety of training methodologies.
For example training with a certain learning rate for a number
of epochs after which the resulting model is retrained with
a different learning rate. Another variation on this training
method is that instead of using the more classic gradual
deprecation of the learning rate it is made to fluctuate from a
high value to a low value and than back for a fixed number
of epochs.
In our experiments we wanted to highlight the performance
of individual models as is, without relying on stacking,
ensembling, transfer learning or retraining of models. The
experiments have been implemented using the Keras12 deep
learning library, using the TensorFlow13 backend running on
NVidia Tesla V100-SXM2 (16GB RAM) hosted on IBM R
Power R System AC922.
The training method chosen is Adam [23]. It is a method
that uses an adaptive learning rate. It stores an exponentially
decaying average of past squared gradient vt as well as the
exponentially decaying average of past gradient mt , which is
similar to momentum (i.e. first and second momentum).
Window height
Fig. 1. Urban3D Sample of RGB image and its associated mask
Stride
Θt+1 = Θt − √
Window width
Scene width
Fig. 2. Scene Tiling Example
For validation purposes the best performing models are
trained again using the ISPRS Potsdam dataset, presented in
Section III, using the same training methodology, particularly
we use the same window size and stride. This measure was
taken as a further testing of topology performance on an
additional benchmark dataset.
It is important to mention that in our setup we discard the
fact that the two datasets have a different spatial resolution,
situation that could be partially mitigated by data augmentation
(random re-sampling at training time)
For our experiments we split the training data into training
and validation sets. The validation dataset represents 20% of
the total available number of tiles. We use as a holdout set of
the entirety of the original scoring dataset from the Urban3D
challenge totaling 62 scenes. This was done to gauge the out
of sample performance of the trained models.
As stated before our desire is to test single model predictive
performance on RGB images only. For this we have selected a
representative set of the current Deep learning network topologies currently used. We investigated different skip connections
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
5
for HSN (addition and concatenation), the placement of activation functions relative to convolution and batch normalization
layers from the network topologies. We also tested different
activation functions inside the same topology, combining the
ReLU (eq 11) and ELU (eq 12) activation.
(
0
for x < 0
f (x) =
(11)
x
for x ≥ 0
(
α(ex − 1)
for x < 0
f (x) =
(12)
x
for x ≥ 0
The final stage of our experiments was the cross validation
of the best performing models obtained. This was done both
on the Urban3D and ISPRS Potsdam dataset.
We selected them based on their validation F1 score. We
used 5 random folds for each model, aiming to check the
predictive performance on the different training and validation
splits. In the following paragraphs we detail each of the model
topologies used.
Validation
In order to ensure that our best performing models selected
thus far have good out of sample performance we ran 5
random sub-sampling validation experiments (denoted s1, s2,
s3, s4, s5), where we randomly selected 20% of the training
dataset as validation. This allows us to check for bias in the
original training validation split. We have chosen this type of
validation as the proportion of the training/validation split is
not dependent on the number of iterations (folds) in the case
of K-Fold cross validation.
IV. R ESULTS
The following section contain an in depth analysis of the
topological modifications made to the selected models as well
as our experimental results.
A. Concatenation and addition in HSN
U-Net and W-Net are designed to support feature fusion
extracted from the encoder blocks to the decoder parts of
the networks. Both in [13] and in [24] the authors analyze
the usage of skip connections with concatenation for feature
0.95
0.9563
Model
hsnv1 Concat
hsnv2 Addition
0.80
18
0.75
16
hsnv2_s5
hsnv1_s5
hsnv2_s4
hsnv1_s4
hsnv2_s3
hsnv1_s3
hsnv2_s2
hsnv1_s2
hsnv2_s1
hsnv1_s1
0.70
Model
(a) Training
Fig. 3. HSN models with different fusion methods
0.9521
0.9558
0.825
0.820
0.815
12
s3
0.835
0.9527
0.9509
s2
Overall accuracy
Min
Max
Avg
0.9544
0.9573 0.9558
0.9525
0.9564 0.9550
0.830
0.9549
14
s1
Kappa Score
Max
Avg
0.8102 0.8009
0.8037 0.7963
The authors in [22] investigate the impact of Batch Normalization layers inserted before or after the activation layer on
two different architectures. Since their results are inconclusive,
we compare the impact of Batch Normalization positioning
relative to the activation layer and adapt our models to the
results.
1) U-Net: The original U-Net topology does not include a
Batch Normalization (BN) layer after the convolutional layers,
still it is widely used in the available U-Net implementations.
0.9494
0.9545
Min
0.7929
0.7819
B. Activation and Normalization
0.9528
0.9578
Avg
0.8265
0.8222
For the rest of the presented experiments, we selected the
HSN model with concatenation skip connections, as it exhibits
higher scores on the holdout set than the original proposed
topology.
22
20
F-1
Max
0.8350
0.8289
TABLE I
HSN MODEL F1- SCORE RESULTS
24
0.85
Min
0.8191
0.8090
hsnv1 Concat (Best F1 score 0.9578)
hsnv2 Add (Best F1 score 0.9563)
26
Epoch
Validation F1 score
28
0.90
fusion, as opposed to common addition skip connection and
they notice an improvement in their model performances. In
the proposed topology, [20] uses common skip connection to
transfer information from lower layers to upper layers. In this
paper, we investigated this approach on the HSN network,
changing the fusion method from addition to concatenation.
The HSN model with concatenation (hsnv1) has proven to
be more stable during the training process compared to the
original HSN model (hsnv2), with addition as feature fusion
method, as depicted in Figure 3a (where the blue crosses
represent the outliers, the orange line the median value and the
box itself 95% of the values). In Figure 3b we plot the best
epoch of both HSN models, using concatenation or addition
as fusion method, considering the validation F1 score.
For the majority of the samples, the model using concatenation (hsnv1 Concat) as feature fusion method reaches the best
F1-Score earlier, potentially starting to overfit the training data.
Interestingly, hsnv1 Concat also obtains higher F1, Kappa
and Overall Accuracy scores on the holdout set, as seen in
Table I. From Figure 3c we can see that hsnv1 Concat obtains
a higher maximum and minimum score and obtains more
consistent results than hsnv2 Add.
s4
(b) Best epoch during training
s5
0.810
hsnv1_concat
hsnv2_add
(c) F1 score on holdout set
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
6
0.96
32
0.9486
30
0.9482
0.92
0.820
28
0.90
26
0.9518
0.818
0.88
Epoch
Validation F1 score
0.94
0.86
24
22
0.9487
0.9543
0.9477
0.952
0.816
0.951
20
0.84
0.9508
18
0.814
unetv2_s5
unetv1_s5
unetv2_s4
unetv1_s4
unetv2_s3
unetv1_s3
unetv2_s2
unetv1_s2
unetv2_s1
unetv1_s1
0.82
16
unetv1 Conv+ReLU+BN+Conv (Best F1 score 0.9520)
14
0.9485
s1
Model
(a) Training
unetv2 Conv+BN+ReLU+BN (Best F1 score 0.9543)
s2
s3
s4
s5
(b) Best epoch during training
unetv1
unetv2
Conv+ReLU+BN+Conv
Conv+BN+ReLU+Conv
(c) F1 score on holdout set
Fig. 4. U-Net models with different positioning of the BN layer
In Figure 4 we present results of our experiments with two
different positions for the BN layer, before or after the ReLU
activation function (non-linearity). In the first configuration,
we apply the BN layer after non-liniarity and use BN before
in the second setting.
In Figure 4a, the U-Net which positions the BN layer before
the activation, unetv2, has a better score during training and a
more consistent score on all most all training validation splits
used.
When applying our models on our holdout set (see Figure
4c), we can see that BN before activation obtained a higher
maximum and minimum score while the other configuration
obtained a more consistent score. Table II shows the F1, Kappa
and Overall Accuracy scores on the holdout set.
In Figure 4b we can see that on most of the samples used
for training, the models which place BN after the activation
layer reach their best score on a much earlier training epoch,
essentially being faster when it comes to obtaining their
maximum training score. It can be argued that this can mean
that they also overfit much earlier, however the difference
between the best performing models appears to be small. This
difference in convergence rates requires further research.
Model
unetv1 Cnv+ReLU+BN+Cnv
unetv2 Cnv+BN+ReLU+Cnv
unetv3 Cnv+ELU+Cnv
unetv4 Cnv+ELU+BN+Cnv
Min
0.8126
0.8145
0.8103
0.8016
F-1
Max
0.8202
0.8211
0.8262
0.8211
Avg
0.8178
0.8176
0.8220
0.8134
Min
0.7853
0.7877
0.7823
0.7741
Kappa Score
Max
Avg
0.7942
0.7914
0.7952
0.7912
0.8004
0.7955
0.7946
0.7866
Overall accuracy
Min
Max
Avg
0.9526 0.9548
0.9541
0.9533 0.9549
0.9541
0.9513 0.9554
0.9542
0.9515 0.9544
0.9533
TABLE II
U-N ET MODEL F1- SCORE RESULTS
In the original U-Net, the authors propose using ReLU as
a non-liniarity function for the network. We have investigated
the placement of BN layer for U-Net topology and choose
BN before non-liniarity for our next experiments (as seen in
unetv2) as it provides an increase in the predictive performance
of the model and is more stable during training. In Figure 5 we
plot the results obtained with U-Net with ELU activation functions. Authors of [22] suggest not using a BN layer together
with an ELU activation and we investigate this approach on
U-Net topology. As seen in Figures 5a and 5c, unetv3 is more
consistent during the training process on almost all training
validation splits and during the testing phase. In addition to
this, in Figure 5b, unetv3 reaches a better performance in
an earlier epoch than the U-net version with BN and ELU
(unetv4). Compared to the results obtained when using ReLU
activation function, we notice an improvement in the F1,
Kappa and Overall Accuracy scores on unetv3, as depicted
in Table II.
2) W-Net: W-Net is composed of two bridged U-Nets,
having a combination of ReLU and ELU activation functions,
proposing a BN layer before the non-liniarity (wnetv2). It
follows the same downsampling and upsampling factors as in
U-Net. As seen in the validation F1 score statistics from Figure
6a and the results on the holdout set, depicted in Figure 6c
we can observe that using the BN layer after the non-liniarity
(wnetv1) leads to an improvement on both the validation and
testing F1 scores, resulting also in a much more stable training
process. From Table III, we can also see higher F1, Kappa and
0.95
0.9515
39
36
33
0.85
0.9475
0.820
30
Epoch
Validation F1 score
0.825
0.90
0.80
27
24
21
0.75
0.9482
0.9508
unetv4_s5
unetv3_s5
unetv4_s4
unetv3_s4
unetv4_s3
unetv3_s3
unetv4_s2
unetv3_s2
unetv4_s1
unetv3_s1
0.95
0.810
0.9533
0.9543
18
0.805
15
unetv3 Conv+ELU+Conv (Best F1 score 0.9543)
unetv4 Conv+BN+ELU+BN (Best F1 score 0.9515)
12
s1
s2
s3
s4
Model
(a) Training
Fig. 5. U-Net models with ELU activation function
0.815
0.9492
0.9502
(b) Best epoch during training
0.95
s5
unetv3
unetv4
Conv+ELU+Conv
Conv+BN+ELU+Conv
(c) F1 score on holdout set
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
7
0.950
0.830
0.9518
(Best F1 score 0.9531)
26
wnetv2 Conv+BN+ELU/ReLU+BN
0.900
(Best F1 score 0.9561)
24
0.95
0.875
22
0.850
Epoch
Validation F1 score
0.9498
wnetv1 Conv+ELU/ReLU+BN+Conv
0.925
0.825
wnetv2_s5
wnetv1_s5
wnetv2_s4
wnetv1_s4
wnetv2_s3
wnetv1_s3
wnetv2_s2
wnetv1_s2
wnetv2_s1
wnetv1_s1
0.826
0.9513
0.824
0.9561
0.9531
16
0.775
0.828
0.9486
20
18
0.800
0.953
0.822
0.9512
14
0.820
0.9524
12
s1
s2
s3
s4
s5
Model
(a) Training
(b) Best epoch during training
wnetv1
wnetv2
Conv+ELU/ReLU+BN+Conv
Conv+BN+ELU/ReLU+Conv
(c) F1 score on holdout set
Fig. 6. W-Net models with different positioning of the BN layer
Overall Accuracy score obtained by wnetv1 on the holdout test.
This is in contrast with the results obtained with U-Net,
where the experiments suggest using BN before activation
layer. However, it might be due to the combination of the
ELU activation function and BN, problem identified by [22].
Interestingly we also observe in Figure 6b that the relative
positioning of the BN layer does not significantly impact
training convergence (training epochs needed for reaching
the maximum validation F1 score), in contrast to the other
analyzed architectures.
Model
wnetv1 Cnv+ELU/ReLU+BN+Cnv
wnetv2 Cnv+BN+ELU/ReLU+Cnv
wnetv3 Cnv+ELU+Cnv
wnetv4 Cnv+ReLU+BN+ELU
wnetv5 Cnv+BN+ReLU+ELU
F-1
Max
0.8304
0.8272
0.8361
0.8349
0.8359
Min
0.8217
0.8188
0.8252
0.8264
0.8235
Avg
0.8261
0.8227
0.8324
0.8295
0.8311
Min
0.7959
0.7928
0.7990
0.8007
0.7975
Kappa Score
Max
Avg
0.8055
0.8007
0.8020
0.7969
0.8115
0.8073
0.8099
0.8041
0.8112
0.8059
Overall accuracy
Min
Max
Avg
0.9553
0.9570
0.9560
0.9546
0.9564
0.9552
0.9547
0.9576
0.9567
0.9554
0.9570
0.9561
0.9549
0.9575
0.9565
TABLE III
W-N ET MODEL F1- SCORE RESULTS
Figure 7 presents the results obtained by removing the
BN near the ELU activation layers in W-Net. Compared to
the original proposed topology, we do notice a perfomance
improvement when using ELU only, with no BN layer, as
in wnetv3 (Table III). In wnetv4 and wnetv5 we experiment
with topologies using a combination of ELU and ReLU, as
originally proposed. However, we do not use a BN layer
near ELU, but keep the BN when using ReLU: after ReLU
(wnetv4) and before ReLU (wnetv5). As depicted in Figure 7b
wnetv5 converges faster in reaching the maximum validation
F1 score. During testing, we notice an improvement in placing
BN before ReLU (Figure 7c), where wnetv5 is more stable
than wnetv4, having also a higher maximum and average F1,
Kappa and Overall Accuracy score (Table III). In accord with
U-net and the original topology, we obtain the best results by
placing BN before ReLU. We do notice an improvement on
holdout set, in both wnetv4 and wnetv5, as opposed to wnetv1
and wnetv2, by eliminating the BN layer near ELU.
3) HSN: We analyze the impact of the position of the
BN layer in the best performing HSN model resulted from
the analysis in section IV-A. Namely we used the HSN with
concatenation for feature fusion, which may be seen as a UNet model with inception and residual blocks. The computed
F1-score on the holdout set is shown in Table IV.
Although positioning the BN before non-liniarity (hsnv3),
as in the original HSN proposed topology, brings a higher
minimum F1 value, the model with BN after the activation
(hsnv1) obtains an improvement if we look at the maximum
and mean values. This is not in accord with the results obtained
by our U-Net experiments.
Figure 8a, shows hsnv1 to have a more consistent validation
score during training as well as a better overall score during
testing (8c).
An interesting difference arises when we compare the results
of HSN best epochs from figure 8b to the ones obtained by
U-Net models from figure 4b. In contrast to U-Net our HSN
model that uses BN layer after the activation layer requires
more epochs to reach its best score when compared to the
model using activation layer before BN layer. The difference in
behavior when it comes to the speed of convergence between
0.950
0.9577
0.900
22
0.875
0.850
0.9499
0.953
0.9514
0.9527
0.9469
0.800
18
0.9543
0.957
0.9568
0.830
0.957
wnetv3
Conv+ELU+Conv
16
0.775
(Best F1 score 0.9577)
0.828
0.9555
wnetv4
14
wnetv5_s5
wnetv3_s5
wnetv5_s4
wnetv3_s4
wnetv5_s3
wnetv3_s3
wnetv5_s2
wnetv3_s2
wnetv5_s1
wnetv3_s1
0.750
0.834
0.832
20
0.825
0.836
0.9511
24
Epoch
Validation F1 score
0.925
0.9526
Conv+ReLU+BN+ELU
(Best F1 score 0.9568)
0.9509
0.826
0.9528
0.824
wnetv5
12
Conv+BN+ReLU+ELU
(Best F1 score 0.9570)
s1
s2
s3
s4
Model
(a) Training
Fig. 7. W-Net models with ELU activation function
s5
wnetv3
Conv+ELU+Conv
(b) Best epoch during training
wnetv4
wnetv5
Conv+ReLU+BN+ELU Conv+BN+ReLU+ELU
(c) F1 score on holdout set
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
8
0.96
0.834
20
0.92
19
0.90
0.9578
0.9535
0.9558
0.9519
0.88
0.86
0.828
17
0.9484
16
0.826
0.9521
15
0.824
0.84
0.9542
hsnv3_s5
hsnv1_s5
hsnv3_s4
hsnv1_s4
hsnv3_s3
hsnv1_s3
hsnv3_s2
hsntv1_s2
hsnv3_s1
hsnv1_s1
14
13
0.822
0.9565
hsnv1 Conv+ReLU+BN+Conv (Best F1 score 0.9578)
0.9509
0.820
hsnv3 Conv+BN+ReLU+Conv (Best F1 score 0.9565)
12
s1
Model
0.832
0.830
18
Epoch
Validation F1 score
0.9494
21
0.94
(a) Training
s2
s3
s4
s5
(b) Best epoch during training
hsnv1
hsnv3
Conv+ReLU+BN+Conv
Conv+BN+ReLU+Conv
(c) F1 score on holdout set
Fig. 8. HSN models with different positioning of the BN layer
these two topologies requires further research.
Model
Min
0.8191
0.8213
0.8275
0.8191
hsnv1 Cnv+ReLU+BN+Cnv
hsnv3 Cnv+BN+ReLU+Cnv
hsnv4 Cnv+ELU+Cnv
hsnv4 Cnv+BN+ELU+Cnv
F-1
Max
0.8350
0.8314
0.8350
0.8266
Avg
0.8265
0.8261
0.8323
0.8239
Min
0.7929
0.7954
0.8015
0.7929
Kappa Score
Max
Avg
0.8102 0.8009
0.8062 0.8006
0.8102 0.8071
0.8012 0.7982
Overall accuracy
Min
Max
Avg
0.9544 0.9573
0.9558
0.9549 0.9565
0.9558
0.9551 0.9573
0.9566
0.9544 0.9559
0.9553
TABLE IV
HSN MODEL F1- SCORE RESULTS
Just as the U-Net model, using the ELU activation function
without BN (hsnv4), brings stability during training (Figure
9a), the networks reaching their best score in an earlier epoch
(Figure 9b), than combining BN and ELU, by placing the BN
before non-liniarity as proposed in the original HSN (hsnv5).
Also, as depicted in Figure 9c, hsnv4 has better results in the
testing phase. ELU only version of HSN also distinguish itself
by yielding the best predictive performance as can be seen in
the results presented in Table IV.
Model
segnetv1 Cnv+ReLU+BN+Cnv
segnetv2 Cnv+BN+ReLU+Cnv
segnetv3 Cnv+ELU+Cnv
segnetv4 Cnv+ELU+BN+Cnv
Min
0.8198
0.7957
0.0000
0.8131
F-1
Max
0.8273
0.8182
0.8265
0.8249
Avg
0.8249
0.8069
0.3243
0.8189
Min
0.7941
0.7690
0.0000
0.7869
Kappa Score
Max
Avg
0.8021 0.7996
0.7924 0.7806
0.8007 0.3136
0.7995 0.7932
Overall accuracy
Min
Max
Avg
0.9551 0.9565
0.9559
0.9518 0.9550
0.9534
0.8645 0.9555
0.8999
0.9541 0.9559
0.9550
TABLE V
S EGNET MODEL F1- SCORE RESULTS
4) Segnet: For the following analysis we have used the
bayesian version of the Segnet topology, originally build
and tested for 224x224 pixel images. The authors propose a
Batch Normalization layer after every convolutional layer and
before the activation function (segnetv2 in our experiments).
However, as seen in Figures 10a and 10b, considering the
validation F1 score, positioning after non-linearity (segnetv1
in our experiments) stabilizes the network, with the model
converging faster to maximum score in all samples. As can
be seen in Table V and Figure 10c segnetv1 also shows an
improvement in the holdout set scores.
Our next step is examining the effects of ELU activation
positioning with respect to the BN layer. The results of these
experiments where surprising. We can see from Figure 11a
that during training, segnetv3 (using only ELU activation)
has a markedly poor performance. Scores obtained during
training for segnetv4 (BN after ELU activation) are much more
consistent while at the same time are higher than segnetv3.
This difference is maintained in the convergence rate of the
best score obtained during training as seen in Figure 10b. We
can see that segnetv4 is consistently reaching its best score at
approximate the same epoch (24 to 30) during training, while
segnetv3 is fluctuating from a maximum of 73 to a minimum
of 1.
The most surprising result can be found in Figure 11c which
shows out of sample predictive performance. The version of
Segnet using only the ELU activation has poor performance
to the point in which some of the models have obtained an F1
score of 0. We think that this is due to several factors. First,
Segnet is one of the larges topologies we have tested having
in excess of 30 million trainable parameters, this coupled with
the ELU activation has likely resulted in an exploding gradient
which is noticeably improved by the addition of BN layer
0.95
0.9507
0.834
45
0.832
40
0.9499
0.85
0.830
35
Epoch
Validation F1 score
50
0.90
0.80
0.828
30
25
0.9486
0.9562
0.826
0.9507 0.9515
0.824
0.75
hsnv5_s5
hsnv4_s5
hsnv5_s4
hsnv4_s4
hsnv5_s3
hsnv4_s3
hsnv5_s2
hsnv4_s2
hsnv5_s1
hsnv4_s1
20
Model
(a) Training
Fig. 9. HSN models with ELU activation function
0.9569
0.9534
0.9575
0.822
15
hsnv4 Conv+ELU+Conv (Best F1 score 0.9575)
0.9516
0.820
hsnv5 Conv+BN+ELU+Conv (Best F1 score 0.9562)
10
s1
s2
s3
s4
(b) Best epoch during training
s5
hsnv4
hsnv5
Conv+ELU+Conv
Conv+BN+ELU+Conv
(c) Testing
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
9
segnetv1 Conv+ReLU+BN+Conv
34
0.90
0.9368
0.9435
(Best F1 score 0.9555)
0.85
32
0.80
30
segnetv2 Cnv+BN+ReLU+Cnv
0.9498
0.825
(Best F1 score 0.9498)
0.820
0.75
Epoch
Validation F1 score
0.95
0.70
28
0.815
0.9466
0.9494
0.9489
26
0.9494
segnetv2_s5
segnetv1_s5
segnetv2_s4
segnetv1_s4
segnetv2_s3
segnetv1_s3
segnetv2_s2
segnetv1_s2
segnetv2_s1
segnetv1_s1
24
22
0.805
0.800
0.9555
0.9524
20
0.795
s1
Model
0.810
0.9516
0.65
(a) Training
s2
s3
s4
s5
segnetv1
segnetv2
Conv+ReLU+BN+Conv
Conv+BN+ReLU+Conv
(b) Best epoch during training
(c) F1 score on holdout set
Fig. 10. Segnet models with different positioning of the BN layer
after the ELU activation in segnetv4. We have also considered
the possibility that in the case of segnetv3 our method of
using randomly sampled training and validation sets might
be unrepresentative for the holdout set. In order to check
this hypothesis we generated a new training/validation splits,
obtaining similar results, some of the models yielding again
an F1 score of 0. This has lead us to strongly favor the
exploding gradient theory as being the most likely one. Further
experimentation into this issue is needed to pinpoint the exact
causality of this behavior.
0.835
0.830
0.825
0.820
0.815
0.810
0.805
0.800
0.795
unetv3
wnetv5
hsnv4
segnetv1
Fig. 12. Best model comparison F1 score on Urban3D holdout set
Overall comparison
Finally we select the best performing models on the Urban3D dataset and analyze how they perform on the ISPRS
Postdam dataset, using the same training and testing methodology. The dataset contains multiple classes out of which we
select only the building class in order to ensure compatibility
with the Urband3D dataset. In Figure 12 and Figure 13 we can
see a performance overview for all of the models presented
in this paper, on both datasets. From all of the models tested
HSN and W-Net have the highest scores on the holdout sets.
In Table VI and Table VII we can view a detailed breakdown
of the scores obtained.
We can see that the W-Net based topologies have a consistently superior score on both the Urban3D and ISPRS Potsdam
datasets, despite the differences in spatial resolution (0.5m for
Urband3D and 5cm for ISPRS Potsdam).
0.9
72
0.92
0.90
0.88
0.86
0.84
0.82
0.80
0.78
unetv3
0.8
wnetv3
wnetv5
hsnv4
segnetv1
Fig. 13. Best model comparison F1 score on Postdam holdout set
segnetv3 Conv+ELU+Conv
0.935
(Best F1 score 0.9439)
0.8
segnetv4 Cnv+ELU+BN+Cnv
64
(Best F1 score 0.9545)
56
0.6
48
Epoch
Validation F1 score
wnetv3
0.7
0.4
32
24
segnetv4_s5
segnetv3_s5
segnetv4_s4
segnetv3_s4
segnetv4_s3
segnetv3_s3
segnetv4_s2
segnetv3_s2
segnetv4_s1
segnetv3_s1
0.6
0.9439
40
0.9442
0.9545
0.9501
0.9488
0.9497
0.2
16
0.8108
0.8961
8
0.898
0
s1
s2
s3
s4
Model
(a) Training
Fig. 11. Segnet models with ELU activation function
(b) Best epoch during training
s5
0.0
segnetv3
segnetv4
Conv+ELU+Conv
Conv+ELU+BN+Conv
(c) F1 score on holdout set
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
10
Model
unetv3 Cnv+ELU+Cnv
wnetv3 Cnv+ELU+Cnv
wnetv5 Cnv+BN+ReLU+ELU
hsnv4 Cnv+ELU+Cnv
segnetv1 Cnv+Relu+BN+Cnv
Min
0.8103
0.8252
0.8235
0.8275
0.7957
F-1
Max
0.8262
0.8361
0.8359
0.8350
0.8182
Avg
0.8220
0.8324
0.8311
0.8323
0.8069
Min
0.7823
0.7990
0.7975
0.8015
0.7690
Kappa Score
Max
Avg
0.8004
0.7955
0.8115
0.8073
0.8112
0.8059
0.8102
0.8071
0.7924
0.7806
Overall accuracy
Min
Max
Avg
0.9513
0.9554 0.9542
0.9547
0.9576 0.9567
0.9549
0.9575 0.9565
0.9551
0.9573 0.9566
0.9518
0.9550 0.9534
TABLE VI
B EST MODEL F1- SCORE RESULT ON U RBAN 3D HOLDOUT SET
Model
unetv3 Cnv+ELU+Cnv
wnetv3 Cnv+ELU+Cnv
wnetv5 Cnv+BN+ReLU+ELU
hsnv4 Cnv+ELU+Cnv
segnetv1 Cnv+Relu+BN+Cnv
Min
0.8849
0.9041
0.8828
0.9009
0.7862
F-1
Max
0.8957
0.9197
0.9121
0.9151
0.8801
Avg
0.8892
0.9116
0.8961
0.9062
0.8302
Min
0.8503
0.8746
0.8486
0.8708
0.7345
Kappa Score
Max
Avg
0.8634
0.8555
0.8951
0.8844
0.8851
0.8651
0.8888
0.8774
0.8440
0.7861
Overall accuracy
Min
Max
Avg
0.9475
0.9517 0.9491
0.9557
0.9630 0.9590
0.9475
0.9595 0.9529
0.9545
0.9606 0.9566
0.9139
0.9452 0.9288
TABLE VII
B EST MODEL F1- SCORE RESULT ON ISPRS P OSTDAM BINARY BUILDING
V. C ONCLUSION AND F UTURE W ORK
In this paper, we showed how different deep neural network
topologies perform for semantic image segmentation. We have
focused on single model performance , proposing changes
to the network topologies, regarding the placement of the
BN layer and alternating between ReLU and ELU activation
functions.
In our first experiments we showed that for HSN we
obtained better performance using concatenation instead of
addition for feature fusion. This difference in predictive performance was observable for both training and holdout set. The
most likely explanation for this is that concatenation preserves
the original form of the previous convolutional layer.
In our second set of experiments we investigated the impact
of different activation functions and BN layer positioning. For
U-Net we have found that placing BN before ReLU activation
yields better predictive performance however, convergence
during training is faster if BN is placed after ReLU. The best
performance was attained by using ELU activation without
any BN layers (unetv3). This also reduces substantially the
memory requirements of U-Net. On both Urban and Postdam
holdout sets, unetv3 ranks 4th amongst our best performing
models.
W-Net has been found to perform better by positioning BN
after the ReLU/ELU layer for both training and holdout sets,
instead of the initial version with BN before non-liniarity.
Furthermore, we have found that when using ELU activation
by removing BN layer from its vicinity we obtained some
performance improvements. The best results were obtained
only by using ELU without any BN layer (wnetv3), just like
in the case of U-Net. However, alternating ReLU and ELU
activation functions (wnetv5) also achieves close results to
ELU only W-Net and ELU HSN, but only by using BN before
ReLU and no BN near ELU activation. From Figures 12 and
13, we can see that wnetv3 has the most consistent results, on
both datasets.
The results obtained for HSN show that it has better performance when placing BN after ReLU activation. It is important
to note that performance of HSN using ELU activation is
similar to what we observed for U-Net and W-Net, having best
performance when using no BN near ELU activation (hsnv4).
ELU only HSN ranks on the 2nd position, in our comparison,
on both Urban and Postdam datasets, after wnetv3.
Segnet experiments resulted in some performance impact
when it comes to positioning of the BN layer relative to
ReLU activation. The Segnet models which had ReLU after
BN layer (segnetv1) resulted in more consistent training scores
and higher F1 score on the holdout set, as opposed to the
original Segnet with BN before non-liniarity. The impact
of using ELU activation in the case of Segnet was quite
significant. Contrary to the topologies tested until this point
ELU activation resulted in significant performance degradation
to the point that some models had an F1 score of 0.0. Adding
BN after ELU activation resulted in a much more consistent
training score and overall predictive performance. Out of our
5 best models, segnetv1 has the lowest performance on the
Urban3D and ISPRS Postdam holdout sets.
We provide a Keras based implementation14 of our best
performing model topologies. We have also publish pretrained
weights for our models, to serve as initialization for further
training of deep learning networks applied on Remote Sensing
data.
Our experiments have shown that further research is required regarding the positioning of activation functions relative
to the BN layer and its impact not only on overall out of sample predictive performance but also on training convergence
rate. As our experiments where conducted on heterogeneous
hardware using bare metal, AWS and Google cloud resources
we are unable to clearly say what the impact on per epoch
training time is with certainty. We were only able to gauge
the number of training epochs it took for a model to reach it’s
maximum convergence during training.
An interesting observation is that topologies based around
skip connections and alternating activation functions obtained
some of the best results. Future research will focus on implementing an improved topology based on this observation.
ACKNOWLEDGMENT
This work was primarily supported by a grant of the Romanian Ministry of Education and Research, CNCS-UEFISCDI,
project number PN-III-P2-2.1-PED-2019-4878, within PNCDI
III. This work was partially supported by a grant of the Romanian Ministry of Education and Research, CNCS-UEFISCDI,
project number PN-III-P4-ID-PCE-2020-0407, within PNCDI
III.
R EFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[5] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: a comprehensive review and list
of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5,
no. 4, pp. 8–36, 2017.
14 Private Zenodo Dataset: https://zenodo.org/record/2611283#.XJ30EyB10s
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 November 2022
doi:10.20944/preprints202211.0226.v1
11
[6] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional map
of the world,” in CVPR, 2018.
[7] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu,
F. Hughes, D. Tuia, and R. Raskar, “Deepglobe 2018: A challenge
to parse the earth through satellite images,” in The IEEE Conference
on Computer Vision and Pattern Recognition (CVPR) Workshops, June
2018.
[8] F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez,
and U. Breitkopf, “The isprs benchmark on urban object classification
and 3d building reconstruction,” ISPRS Ann. Photogramm. Remote Sens.
Spat. Inf. Sci, vol. 1, no. 3, pp. 293–298, 2012.
[9] H. Goldberg, M. Brown, and S. Wang, “A benchmark for building
footprint classification using orthorectified rgb imagery and digital
surface models from commercial satellites,” in Proceedings of IEEE
Applied Imagery Pattern Recognition Workshop 2017, 2017.
[10] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440.
[11] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
2015, pp. 234–241.
[12] V. Iglovikov, S. Mushinskiy, and V. Osin, “Satellite imagery feature detection using deep convolutional neural network: A kaggle competition,”
arXiv preprint arXiv:1706.06169, 2017.
[13] W. Chen, Y. Zhang, J. He, Y. Qiao, Y. Chen, H. Shi, and X. Tang, “Wnet: Bridged u-net for 2d medical image segmentation,” arXiv preprint
arXiv:1807.04459, 2018.
[14] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv
preprint arXiv:1511.00561, 2015.
[15] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model
uncertainty in deep convolutional encoder-decoder architectures for
scene understanding,” arXiv preprint arXiv:1511.02680, 2015.
[16] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
[17] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and
U. Stilla, “Classification with an edge: Improving semantic image segmentation with boundary detection,” ISPRS Journal of Photogrammetry
and Remote Sensing, vol. 135, pp. 158–172, 2018.
[18] S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of
the IEEE international conference on computer vision, 2015, pp. 1395–
1403.
[19] D. Marmanis, J. D. Wegner, S. Galliani, K. Schindler, M. Datcu, and
U. Stilla, “Semantic segmentation of aerial images with an ensemble
of cnns,” ISPRS Annals of the Photogrammetry, Remote Sensing and
Spatial Information Sciences, vol. 3, p. 473, 2016.
[20] Y. Liu, D. Minh Nguyen, N. Deligiannis, W. Ding, and A. Munteanu,
“Hourglass-shapenetwork based semantic segmentation for high resolution aerial imagery,” Remote Sensing, vol. 9, no. 6, p. 522, 2017.
[21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[22] D. Mishkin, N. Sergievskiy, and J. Matas, “Systematic evaluation of
convolution neural network advances on the imagenet,” Computer Vision
and Image Understanding, vol. 161, pp. 11–19, 2017.
[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic
optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available:
http://arxiv.org/abs/1412.6980
[24] Z. Miao, K. Fu, H. Sun, X. Sun, and M. Yan, “Automatic water-body
segmentation from high-resolution satellite images via deep networks,”
IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 4, pp. 602–
606, 2018.
Teodora Selea received her B.S. from the West
University of Timisoara, Romania, in 2015, and is
a 2nd year Ph.D. student at the West University
of Timisoara, Romania. Starting from 2015 she
participated in multiple research projects in the area
of distributed systems, machine learning and remote
sensing. Her research interest include cloud computing, machine learning, computer vision and remote
sensing analysis.
Gabriel Iuhasz received his Ph.D. degree in Machine Learning and Distributed systems from the
West University of Timisoara, Romania, in 2014.
Starting from 2017 he is an Assistant Professor in the
Department of Computer Science, West University
of Timisoara. From 2013 he is also a researcher
at Institute e-Austria, Romania. He was involved in
multiple research projects dealing with distributed
systems, machine learning and cloud computing.
Recently he is focusing on research in the fields of
remote sensing, image processing and deep learning.
Marian Neagul received the Ph.D. degree in distributed systems from West University of Timisoara,
Romania, in 2015. From 2016 he is Assistant Professor with the Department of Computer Science,
West University of Timisoara. From 2015 he is a
Research Scientist at Institute e-Austria, Romania.
He participated in multiple research projects in the
area of distributed systems, machine learning and
remote sensing. His research interests include cloud
computing, machine learning, remote sensing image
processing, self-organizing systems and high performance computing