Inter-Level Feature Balanced Fusion Network for Street Scene Segmentation
<p>Visualization examples of features in different stages of the concatenation process on the Cityscapes dataset [<a href="#B19-sensors-21-07844" class="html-bibr">19</a>]. From left to right are input images, low-level features, deep-level features, and features after concatenation fusion.</p> "> Figure 2
<p>The overall structure of IFBFNet. There are three parts: encoder, decoder, and spatial stream. The encoder is composed of a backbone network and ASPP to extract rich high-level semantic information. In the decoder, we added inter-level feature balanced fusion module into each skip-connection structure. The spatial stream supplements more low-level spatial information.</p> "> Figure 3
<p>The structure design of the inter-level feature balanced fusion module. <span class="html-italic">©</span>, ⊙, and ⨁ respectively stand for concatenation, multiplication, and element-wise addition. The fusion weights of high-level features <math display="inline"><semantics> <msub> <mi>W</mi> <mi>h</mi> </msub> </semantics></math> and low-level <math display="inline"><semantics> <msub> <mi>W</mi> <mi>l</mi> </msub> </semantics></math> features are calculated respectively.</p> "> Figure 4
<p>Structure diagram of spatial flow; the right-hand side is the visualization of the feature map after convolution of each layer. After three convolution layers, a quarter-size feature map containing spatial relations is obtained, as shown at the bottom of the figure.</p> "> Figure 5
<p>Visualization results of inter-level feature balanced fusion compared to common fusion on the Cityscapes set. For each line, we show an input image, and feature mapping from two levels. Meanwhile, we show the visualization of general fusion of two hierarchical features and visualization of fusion results using inter-level feature balanced fusion strategies.</p> "> Figure 6
<p>Visualization results of IFBFNet based on ResNet101 baseline on the Cityscapes validation set.</p> ">
Abstract
:1. Introduction
- An inter-level feature balanced fusion module was designed to solve the problem of feature imbalance caused by traditional concatenation or element-wise addition, which makes the fusion more balanced and utilization of features more effective.
- A shallow spatial stream with only three convolution layers was designed and added into the network, which is fused with the main semantic features before outputting the prediction in the decoder. This further enriches the spatial information.
- Our IFBFNet achieved a comparative performance of 81.2% to mIoU on the Cityscapes dataset with only finely annotated data used for training, significantly improving over baselines.
2. Related Work
2.1. Semantic Segmentation
2.2. Pyramid Module
2.3. Encoder–Decoder
2.4. Attention Mechanism
3. Approach
3.1. Our Encoder–Decoder
3.2. Inter-Level Feature Balanced Fusion Module
3.3. Spatial Stream
3.4. Loss Function
4. Experiment
4.1. Implementation Details
4.2. Experimental Results
4.3. Ablation Study
4.3.1. Baseline Network
4.3.2. Ablation Study for Inter-Level Feature Balanced Fusion Module
4.3.3. Ablation Study for Spatial Stream
4.3.4. Ablation Study for Improvement Strategies
4.4. Visualization of Inter-Level Feature Balanced Fusion Module
4.5. Comparing with the State-of-the-Art
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Pohlen, T.; Hermans, A.; Mathias, M.; Leibe, B. Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016. [Google Scholar]
- Wang, K.; Yan, F.; Zou, B.; Tang, L.; Lv, C. Occlusion-Free Road Segmentation Leveraging Semantics for Autonomous Vehicles. Sensors 2019, 19, 4711. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Palafox, P.R.; Betz, J.; Nobis, F.; Riedl, K.; Lienkamp, M. SemanticDepth: Fusing Semantic Segmentation and Monocular Depth Estimation for Enabling Autonomous Driving in Roads without Lane Lines. Sensors 2019, 19, 3224. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Syrris, V.; Pesek, O.; Soille, P. SatImNet: Structured and Harmonised Training Data for Enhanced Satellite Imagery Classification. Remote Sens. 2020, 12, 3358. [Google Scholar] [CrossRef]
- Zhang, M.; Jing, W.; Lin, J.; Fang, N.; Damaeviius, R. NAS-HRIS: Automatic Design and Architecture Search of Neural Network for Semantic Segmentation in Remote Sensing Images. Sensors 2020, 20, 5292. [Google Scholar] [CrossRef] [PubMed]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Zhang, Z.; Fu, H.; Dai, H.; Shen, J.; Shao, L. ET-Net: A Generic Edge-aTtention Guidance Network for Medical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Qi, K.; Yang, H.; Li, C.; Liu, Z.; Wang, M.; Liu, Q.; Wang, S. X-Net: Brain Stroke Lesion Segmentation Based on Depthwise Separable Convolution and Long-range Dependencies. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2019; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Lin, G.; Shen, C.; van den Hengel, A.; Reid, I. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City. UT, USA, 18–23 June 2018. [Google Scholar]
- Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, L.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.C.H.; Heinrich, M.P.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Cheng, H.K.; Chung, J.; Tai, Y.W.; Tang, C.K. CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the 4th International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Liu, W.; Rabinovich, A.; Berg, A.C. ParseNet: Looking Wider to See Better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
- Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large Kernel Matters—Improve Semantic Segmentation by Global Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Context Contrasted Feature and Gated Multi-Scale Aggregation for Scene Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Zhang, J.; Lv, X.; Zhang, H.; Liu, B. AResU-Net: Attention Residual U-Net for Brain Tumor Segmentation. Symmetry 2020, 12, 721. [Google Scholar] [CrossRef]
- Drozdzal, M.; Vorontsov, E.; Chartrand, G.; Kadoury, S.; Pal, C. The Importance of Skip Connections in Biomedical Image Segmentation. In Deep Learning and Data Labeling for Medical Applications; Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.M.R.S., Belagiannis, V., Papa, J.P., Nascimento, J.C., Loog, M., Lu, Z., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 179–187. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Zhou, C.; Wu, M.; Lam, S.-K. SSA-CNN: Semantic Self-Attention CNN for Pedestrian Detection. arXiv 2019, arXiv:1902.09080. [Google Scholar]
- Yuan, Y.; Wang, J. OCNet: Object Context Network for Scene Parsing. arXiv 2018, arXiv:1809.00916. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, L.C.; Yang, Y.; Wang, J.; Xu, W.; Yuille, A.L. Attention to Scale: Scale-Aware Semantic Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zhang, F.; Chen, Y.; Li, Z.; Hong, Z.; Liu, J.; Ma, F.; Han, J.; Ding, E. ACFNet: Attentional Class Feature Network for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Wu, Z.; Shen, C.; van den Hengel, A. High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks. arXiv 2016, arXiv:1604.04339. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-based Object Detectors with Online Hard Example Mining. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
- Robbins, H.; Monro, S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Bulò, S.R.; Porzi, L.; Kontschieder, P. In-Place Activated BatchNorm for Memory-Optimized Training of DNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Ghiasi, G.; Fowlkes, C.C. Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 519–534. [Google Scholar]
- Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding Convolution for Semantic Segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar]
- Wu, Z.; Shen, C.; van den Hengel, A. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition. Pattern Recognit. 2019, 90, 119–133. [Google Scholar] [CrossRef] [Green Version]
Methods | Road | s.Walk | Build | Wall | Fence | Pole | t-Light | t-Sign | Veg | Terrain | Sky | Person | Rider | Car | Truck | Bus | Train | Motor | Bike | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LRR [43] | 97.7 | 79.9 | 90.7 | 44.4 | 48.6 | 58.6 | 68.2 | 72.0 | 92.5 | 69.3 | 94.7 | 81.6 | 60.0 | 94.0 | 43.6 | 56.8 | 47.2 | 54.8 | 69.7 | 69.7 |
DeepLabV2 [25] | 97.9 | 81.3 | 90.3 | 48.8 | 47.4 | 49.6 | 57.9 | 67.3 | 91.9 | 69.4 | 94.2 | 79.8 | 59.8 | 93.7 | 56.5 | 67.5 | 57.5 | 57.7 | 68.8 | 70.4 |
Piecewise [12] | 98.0 | 82.6 | 90.6 | 44.0 | 50.7 | 51.1 | 65.0 | 71.7 | 92.0 | 72.0 | 94.1 | 81.5 | 61.1 | 94.3 | 61.1 | 65.1 | 53.8 | 61.6 | 70.6 | 71.6 |
PSPNet [17] | 98.2 | 85.8 | 92.8 | 57.5 | 65.9 | 62.6 | 71.8 | 80.7 | 92.4 | 64.5 | 94.8 | 82.1 | 61.5 | 95.1 | 78.6 | 88.3 | 77.9 | 68.1 | 78.0 | 78.8 |
Multiscale [22] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 77.8 |
DeeplabV3+ [2] | 97.8 | 83.4 | 92.8 | 67.6 | 63.2 | 64.5 | 73.9 | 79.9 | 92.7 | 70.1 | 94.8 | 83.1 | 67.4 | 95.0 | 80.0 | 90.0 | 73.1 | 71.6 | 76.7 | 79.9 |
Ours | 98.2 | 86.0 | 93.9 | 61.4 | 67.5 | 66.8 | 74.8 | 81.4 | 93.2 | 69.1 | 95.4 | 84.7 | 66.2 | 95.7 | 87.0 | 90.1 | 84.2 | 68.5 | 78.2 | 81.2 |
BaseNet | ASPP | Skip-Connection | Feature Balanced Fusion | Spatial Stream | Mean IoU (%) |
---|---|---|---|---|---|
ResNet50 | 73.82 | ||||
ResNet50 | ✓ | 76.62 | |||
ResNet50 | ✓ | ✓ | 78.68 | ||
ResNet50 | ✓ | ✓ | ✓ | 80.11 | |
ResNet50 | ✓ | ✓ | ✓ | ✓ | 80.45 |
ResNet101 | 75.00 | ||||
ResNet101 | ✓ | 77.46 | |||
ResNet101 | ✓ | ✓ | 79.86 | ||
ResNet101 | ✓ | ✓ | ✓ | 81.05 | |
ResNet101 | ✓ | ✓ | ✓ | ✓ | 81.16 |
Channel | 16 | 32 | 48 | 64 | 128 | 256 |
mIoU% | 80.75 | 80.13 | 80.99 | 80.90 | 81.16 | 80.70 |
Method | Backbone | OHEM | DA | MS | mIoU% |
---|---|---|---|---|---|
IFBFNet | ResNet101 | × | × | × | 77.40 |
IFBFNet | ResNet101 | ✓ | × | × | 78.88 |
IFBFNet | ResNet101 | ✓ | ✓ | × | 79.95 |
IFBFNet | ResNet101 | ✓ | ✓ | ✓ | 81.16 |
Methods | ||||||||||
mIoU (%) | 80.29 | 81.03 | 80.99 | 81.16 | 80.51 | 80.72 | 81.03 | 79.97 | 79.97 | 80.64 |
Methods | Road | s.Walk | Build | Wall | Fence | Pole | t-Light | t-Sign | Veg | Terrain | Sky | Person | Rider | Car | Truck | Bus | Train | Motor | Bike | Mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DeepLab-v2+CRF [25] | 97.9 | 81.3 | 90.3 | 48.8 | 47.4 | 49.6 | 57.9 | 67.3 | 91.9 | 69.4 | 94.2 | 79.8 | 59.8 | 93.7 | 56.5 | 67.5 | 57.5 | 57.7 | 68.8 | 70.4 |
FRRN [3] | 98.2 | 83.3 | 91.6 | 45.8 | 51.1 | 62.2 | 69.4 | 72.4 | 92.6 | 70 | 94.9 | 81.6 | 62.7 | 94.6 | 49.1 | 67.1 | 55.3 | 53.5 | 69.5 | 71.8 |
RefineNet [15] | 98.2 | 83.3 | 91.3 | 47.8 | 50.4 | 56.1 | 66.9 | 71.3 | 92.3 | 70.3 | 94.8 | 80.9 | 63.3 | 94.5 | 64.6 | 76.1 | 64.3 | 62.2 | 70 | 73.6 |
DUC [44] | 98.5 | 85.5 | 92.8 | 58.6 | 55.5 | 65 | 73.5 | 77.9 | 93.3 | 72 | 95.2 | 84.8 | 68.5 | 95.4 | 70.9 | 78.8 | 68.7 | 65.9 | 73.8 | 77.6 |
PSPNet [17] | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 78.4 |
ResNet-38 [45] | 98.5 | 85.7 | 93.1 | 55.5 | 59.1 | 67.1 | 74.8 | 78.7 | 93.7 | 72.6 | 95.5 | 86.6 | 69.2 | 95.7 | 64.5 | 78.8 | 74.1 | 69 | 76.7 | 78.4 |
DeepLabV3+ [2] | 98.6 | 86.3 | 92.9 | 57.4 | 59.5 | 64.6 | 73.0 | 77.6 | 93.4 | 72.5 | 95.5 | 85.6 | 69.5 | 95.7 | 69.0 | 84.6 | 74.6 | 67.2 | 75.0 | 78.5 |
IFBFNet (ours) | 98.7 | 87.0 | 93.3 | 53.8 | 60.7 | 67.7 | 76.4 | 80.1 | 93.8 | 73.7 | 95.7 | 87.3 | 72.3 | 96.2 | 70.7 | 82.5 | 75.4 | 70.2 | 77.4 | 79.6 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, D.; Fan, C.; Zou, L.; Zuo, Q.; Jiang, H.; Liu, Y. Inter-Level Feature Balanced Fusion Network for Street Scene Segmentation. Sensors 2021, 21, 7844. https://doi.org/10.3390/s21237844
Li D, Fan C, Zou L, Zuo Q, Jiang H, Liu Y. Inter-Level Feature Balanced Fusion Network for Street Scene Segmentation. Sensors. 2021; 21(23):7844. https://doi.org/10.3390/s21237844
Chicago/Turabian StyleLi, Dongqian, Cien Fan, Lian Zou, Qi Zuo, Hao Jiang, and Yifeng Liu. 2021. "Inter-Level Feature Balanced Fusion Network for Street Scene Segmentation" Sensors 21, no. 23: 7844. https://doi.org/10.3390/s21237844