Combining DeepLab and U-Net: A new Approach to Image Segmentation
Emanuel Schaerer Kolja Diehl Maximilian von Loesch
1. Abstract
We developed and evaluated a hybrid segmentation
model called DeepLabUnet, which combines U-Net and
DeepLabv3. This model achieves an IoU of 76% on the
public leaderboard for detecting ETH Zurich-branded mugs
through targeted hyperparameter optimization.
2. Approach
To detect ETH Zurich-labeled mugs in the test dataset, we
integrated two approaches: one inspired by a PyTorch tu-
torial on U-Net for image segmentation (Chandhok, 2021), (a) The model uses an encoder-decoder structure, with DeepLabv3
as the encoder to enhance edge feature extraction, and a decoder
and another based on Microsoft Research’s work on deep to restore feature details and improve segmentation accuracy.
residual learning (He et al., 2015). This fusion, which we
term DeepLabUnet, combines the strengths of U-Net’s ar-
chitecture with deep residual learning.
Our goal is to enhance the Intersection-over-Union (IoU),
the primary metric for our model’s performance, through
hyperparameter optimization. We also use Binary Cross-
Entropy Loss (BCEloss) as a secondary metric. Detailed
hyperparameters are listed in the appendix (Table 1). IoU is
defined as follows:
|P ∩ G|
IoU(P, G) =
|P ∪ G|
(b) Architecture of the U-Net segmentation model, showing direct
connections between encoder and decoder at each level.
Where: |P ∩ G| is the area of overlap between the predicted
and the ground truth masks and |P ∪ G| denotes the total
Figure 1. Combined illustration of the DeepLab and U-Net archi-
area covered by both masks. A higher IoU signifies superior tectures used in the hybrid segmentation model.
segmentation performance.
U-Net is renowned for tasks like medical image segmenta-
tion, excelling due to its skip connections that retain contex-
tual information, crucial for precise mask generation (Doğan DeepLab employs separable convolutions in ASPP and de-
et al., 2024). coder modules, enhancing segmentation and edge detail
preservation (1a).
DeepLab enhances conventional convolutional networks by
employing atrous (dilated) convolutions, expanding the re- U-Net’s encoder-decoder structure compresses input images
ceptive field. We hypothesize that combining U-Net and to lower dimensions using techniques like max-pooling and
DeepLab will leverage precise localization and multi-scale convolution, training the model to recognize patterns and
capabilities, outperforming individual models. This combi- extract features. The decoder then restores the image to
nation will be explored further in Section 3.1.3. its original dimensions, with skip connections preserving
spatial details during down-sampling (1b).
DeepLab, known for atrous convolution and atrous spa-
tial pyramid pooling (ASPP), excels in object localization The hybrid DeepLabUnet approach aims to boost segmenta-
within images (Wang et al., 2024). tion precision and robustness.
1
SML Report Template
3. Experiments Integrated into the model is the ASPP module from
DeepLab, designed for capturing multi-scale information
3.1. Experimental Setup through parallel dilated convolutions with different dilation
3.1.1. TRAIN D EEP L AB U NET. PY rates, such as 6, 12, and 18, allowing the model to capture
context at multiple scales without significantly increasing
At its core, this main function trains and optimizes the parameters. Additionally, the ASPP module includes an
model’s weights over several epochs. Within each epoch, image-level pooling branch to capture global context, which
the training is executed in batches, aiming for a more robust is upsampled and concatenated with the outputs of the di-
model. To avoid overfitting, we use 20% of our data as a lated convolutions.
validation subset. We are using different loss functions such
as the BCEWithLogitsLoss function (Torch-Contributors, The decoder combines multi-scale features from the ASPP
2025b). It takes the raw model output data as an argument, module with high-resolution encoder features via skip con-
applies a sigmoid function to convert the outputs into proba- nections, a key U-Net feature, enabling the model to utilize
bilities, and computes the loss between those probabilities both global context and fine-grained spatial details. It em-
and the actual ground truth labels. Afterwards, the model’s ploys up-convolutional layers followed by convolutional
weights are being optimized. Based on the recommenda- blocks to progressively refine segmentation maps, reducing
tion of Mistral AI (LeChat, 2025), we choose the Adam channel dimensions while increasing spatial resolution to
optimizer (Torch-Contributors, 2025a) with variable initial match the input image’s resolution.
learning rates. Furthermore, we implement schedulers, such
as PyTorch’s ReduceLROnPlateau (2025c). It observes the 3.2. Results
loss over the training iterations. Whenever no significant
We evaluate our models on a validation set using the IoU
improvement is observed, the scheduler reduces the learn-
as primary metric. Table 3 summarizes the results for two
ing rate in order to escape a local minimum. Within each
configurations: baseline U-Net and DeepLabUnet, the hy-
epoch, the model is trained with batches of the training data,
brid model combining predictions from both U-Net and
resulting in a more stable performance. After each epoch,
DeepLab. We observe that DeepLabUnet outperforms the
the model is validated and the resulting IoU and loss are
U-Net baseline (73%), yielding an IoU of 76%. With hy-
plotted. The generated plots are used for analyzing the im-
perparameter optimization, we find the best parameters to
pact of different variations/hyperparameters, namely initial
be a training batch size of 4, a learning rate of 4 · 10−4 , as
learning rate, training batch size, loss functions, optimiza-
loss function MultiLabelSoftMarginLoss, as best optimizer
tion functions and learning rate schedulers. Each of those
Adam and as best learning rate scheduler ReduceLROn-
are iterated four times, while all others are kept at initial
Plateau (figure 13). In the appendix we provide the reader
values (proposed by ChatGPT). A table in regard with their
with graphical information about the evaluation of stated
optimization is provided in the appendix (Table 2).
hyperparameters. The worst performance is observed when
using a very small learning rate of 4 · 10−5 (figure 19). Aug-
3.1.2. ETH MUGS DATASET. PY
menting the training does not improve the performance on
We train our models using the provided ETH mugs dataset, the Kaggle Leaderboard (figures 18 to 20).
consisting of RGB images and the corresponding masks
that only show ETH branded mugs. All images are re- 3.3. Conclusions
sized to a uniform resolution of 252 × 378 pixels. Through
The experiments demonstrate that within our tested setup
PyTorch utilities (2025d), the pixel values are normalized.
the custom DeepLabUnet is more effective in terms of IoU
To improve both robustness and performance on unseen
performance than a standalone U-Net architecture for the
data, augmentation (flipped cutouts) is tested and compared
task of segmenting ETH Zurich mug logos. The improved
to non-augmented data. This dataset class loads the ETH
performance likely results because the combination of Unet
Mugs, resizes, scales and transforms them.
and DeepLab delivers precise segmentational abilities via
skip connections and multi-scale functionality using ASPP.
3.1.3. D EEP L AB U NET. PY
Performing hyperparameter optimization, we find that our
The script implements the previously motivated DeepLab- standard parameters with MultiLabelSoftMarginLoss as loss
Unet-combination. function perform best in terms of public IoU. As the mug
regions to detect exhibit significant scale variability, future
The model’s encoder, primarily based on the U-Net archi-
work could investigate more sophisticated ensemble strate-
tecture, employs convolutional blocks to extract features at
gies such as confidence-weighted fusion. We assume that
various abstraction levels, with each block featuring two
these approaches could neutralize inconsistent predictions
convolutional layers, batch normalization, and ReLU activa-
across different scales and improve overall robustness.
tion, effectively capturing detailed spatial information.
2
SML Report Template
Disclaimer remote sensing images based on deeplabv3+. na-
ture, 2024. URL https://www.nature.com/
Our code is synchronized across different devices thanks to articles/s41598-024-60375-1#citeas.
git. All files can be accessed via github (von Loesch et al.).
For this project, we use generative large language models
for coding (LeChat, 2025) and translational (ChatGPT4-o,
2025) purposes.
References
Chandhok, S. U-net: Training image segmentation
models in pytorch. pyimagesearch, 2021. URL
https://pyimagesearch.com/2021/11/08/
u-net-training-image-segmentation-
models-in-pytorch/.
ChatGPT4-o, 2025. For coding and translation purposes,
we use OpenAi’s ChatGPT 4-o.
Doğan, K., Selçuk, T., and Alkan, A. An enhanced mask r-
cnn approach for pulmonary embolism detection and seg-
mentation. Diagnostics, 14(11), 2024. ISSN 2075-4418.
doi: 10.3390/diagnostics14111102. URL https://
www.mdpi.com/2075-4418/14/11/1102. pp. 5.
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-
ual learning for image recognition. Microsoft Re-
search, 2015. URL https://arxiv.org/pdf/
1512.03385v1.
LeChat, M., 2025. For coding and translation purposes, we
use Mistral’s Le Chat.
Torch-Contributors. PyTorch - Adam, 2025a. URL
https://github.com/pytorch/pytorch/
blob/v2.7.0/torch/optim/adam.py#L33.
Torch-Contributors. PyTorch - BCEWithLogit-
sLoss, 2025b. URL https://docs.pytorch.
org/docs/stable/generated/torch.nn.
BCEWithLogitsLoss.html.
Torch-Contributors. PyTorch - Reduce Learning Rate on
Plateau, 2025c. URL https://docs.pytorch.
org/docs/stable/generated/torch.optim.
lr_scheduler.ReduceLROnPlateau.html.
Torch-Contributors. PyTorch - Transforming and augment-
ing images, 2025d. URL https://docs.pytorch.
org/vision/stable/transforms.html.
von Loesch, M., Diehl, K., and Schaerer, E. URL
https://github.com/Therealbrocoli/
SML_Project_2. We synchronised our work using git.
All code is accessible via the provided github repository.
Wang, Y.and Yang, L., Liu, X., and et al. An improved
semantic segmentation algorithm for high-resolution
3
SML Report Template
4. Appendix
Table 1. Hyperparameters to optimize and their respective
meaning.—————————————————————————————————
H YPERPARAMETER D ESCRIPTION
LEARNING RATE T HESTEP SIZE AT WHICH THE
MODEL’ S WEIGHTS ARE UPDATED .
TRAIN BATCH SIZE N UMBER OF TRAINING SAMPLES
PROCESSED PER ITERATION .
LOSS FUNCTION M EASURES HOW WELL THE
MODEL’ S PREDICTIONS MATCH
THE TARGET VALUES .
OPTIM FUNCTION O PTIMIZATION ALGORITHM USED
TO MINIMIZE THE LOSS .
LR SCHEDULER A DJUSTS THE LEARNING RATE DY-
NAMICALLY BASED ON TRAINING
PROGRESS .
Table 2. Hyperparameter performance against IoU.
Best hyperparameters are marked bold—————————————————————————————————
H YPERPARAMETER VALUES
LEARNING RATE 4·10−2 , 4 · 10−3 , 4·10−4 , 4·10−5
TRAIN BATCH SIZE 4, 8, 16, 32
LOSS FUNCTION D ICE L OSS , BCEW ITH L OGIT-
S L OSS , M ULTI L ABEL S OFT-
M ARGIN L OSS
OPTIM FUNCTION A DAM , A DAM W
LR SCHEDULER C OSINE A NNEALING WARM R E -
STARTS , O NE C YCLE LR, R E -
DUCE LRO N P LATEAU
Table 3. Validation IoU scores for different model
configurations. —————————————————————————————————
M ODEL IOU C OMMENTS
U-N ET 73% BASELINE SEGMENTATION
MODEL
D EEP L AB U NET 76% M ULTI - SCALE
CONTEXT EN -
HANCES PERFORMANCE
4
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 2. Training plots and results of the DeepLabUnet model with learning rate of 4 · 10−2 and a public score prediction of 0.7352 on
Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 3. Training plots and results of the DeepLabUnet model with learning rate of 4 · 10−3 and a public score prediction of 0.7556 on
Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 4. Training plots and results of the DeepLabUnet model with a learning rate of 4 · 10−4 , which is also the default rate. The resulting
public score on Kaggle is 0.7460.
5
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 5. Training plots and results of the DeepLabUnet model with a learning rate of 4 · 10−5 and a public score prediction of 0.5379 on
Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 6. Training plots and results of the DeepLabUnet model with a training batch size of 4 and a public score prediction of 0.7579 on
Kaggle.
(b) test prediction mask ID0000
(a) Left: loss vs iteration. Right: IoU vs epoch
Figure 7. Training plots and results of the DeepLabUnet model with a training batch size of 8, which is also the default. The public score
on Kaggle is 0.7352
6
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 8. Training plots and results of the DeepLabUnet model with a training batch size of 16 and a public score prediction of 0.7127 on
Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 9. Training plots and results of the DeepLabUnet model with a training batch size of 32 and a public score prediction of 0.7117 on
Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 10. Training plots and results of the DeepLabUnet model with the loss function BCEWithLogitsLoss from pytorch, the resulting
public score is 0.7623 on Kaggle.
7
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 11. Training plots and results of the DeepLabUnet model with the loss function DiceLoss self-written, the resulting public score is
0.7440 on Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 12. Training plots and results of the DeepLabUnet model, The training loss function for this configuration contains DiceLoss in
combination with BCEWithLogitsLoss, which is also the default and has a resulting public score of 0.7460 on Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 13. Training plots and results of the DeepLabUnet model with the loss function MultiLabelSoftMarginLoss from pytorch, the
resulting public score is 0.7646 on Kaggle.
8
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 14. Training plots and results of the DeepLabUnet model with a Adam optimizer and ReduceLROnPlateu scheduler combination.
Nevertheless this is also the default configuration and the resulting public score is 0.7460 on Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 15. Training plots and results of the DeepLabUnet model with a AdamW optimizer and ReduceLROnPlateu scheduler combination.
The resulting public score is 0.7410 on Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 16. Training plots and results of the DeepLabUnet model with a AdamW optimizer and CosineAnnealingWarmRestarts scheduler
combination. The resulting public score is 0.7310 on Kaggle.
9
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 17. Training plots and results of the DeepLabUnet model with a AdamW optimizer and OneCycleLR scheduler combination. The
resulting public score is 0.7395 on Kaggle.
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 18. Training plots and results of the DeepLabUnet model with the combination of the best hyperparameters above. The Kaggle
public score is 0.7550
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 19. Training plots and results of the DeepLabUnet model with the combination of the best hyperparameters above. Here is the data
augmented in form of random cropping and flipping. The Kaggle public score is 0.6898
10
SML Report Template
(a) Left: loss vs iteration. Right: IoU vs epoch
(b) test prediction mask ID0000
Figure 20. Training plots and results of the DeepLabUnet model with the combination of the above-mentioned best hyperparameters. The
augmented data as shown in Figure 19 is now combined with the original data. The public Kaggle score is 0.7317.
11