Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery
"> Figure 1
<p>Alternative network structures in dense semantic labeling (<b>a</b>) The classic FCN-32s, (<b>b</b>) The classic FCN-8s (with skip architecture), (<b>c</b>) Encoder-Decoder, (<b>d</b>) Atrous Spatial Pyramid Pooling and (<b>e</b>) Ours.</p> "> Figure 2
<p>The pipeline of our dense semantic labeling system, including data preprocessing, network training, testing and post-processing.</p> "> Figure 3
<p>The structure of ResNet50/101 which consists of one convolution layer and four Blocks. Each Block has several Bottleneck units. Inside the bottleneck unit, there is a shortcut connection between the input and output. In this study, we choose ResNet101 as the backbone of our model.</p> "> Figure 4
<p>The architecture of our proposed fully convolutional network with the fusion of ASPP and encoder-decoder structures. ResNet101 followed by ASPP is the encoder part to extract multiple scale contextual information. While the proposed decoder shown as the purple blocks refines the boundary of object. In the end, the multi-scale loss function guides the training procedure.</p> "> Figure 5
<p>A sample of remote sensing imagery with different channels of data and the corresponding groudtruth in the Potsdam datasets.</p> "> Figure 6
<p>A sample of the result of our proposed model on the Potsdam dataset. (<b>a</b>) the high-resolution remote sensing imageries. (<b>b</b>) the corresponding groundtruth. (<b>c</b>) the prediction maps of our proposed model.</p> "> Figure 7
<p>A sample of the result of our proposed model on the Vaihingen dataset. (<b>a</b>) the high-resolution remote sensing imageries (<b>b</b>) the corresponding groundtruth (<b>c</b>) the prediction maps of our proposed model.</p> "> Figure 8
<p>A sample of comparison prediction results of different methods on the Potsdam dataset. (<b>a</b>) input imagery, (<b>b</b>) normalized DSM, (<b>c</b>) corresponding ground truth, (<b>d</b>) result of SVL_1, (<b>e</b>) result of FCN, (<b>f</b>) result of DST_5, (<b>g</b>) result of RIT6, (<b>h</b>) result of U-net, (<b>i</b>) result of DeepLab_v3, (<b>j</b>) result of DeepLab_v3+ and (<b>k</b>) our model.</p> "> Figure 9
<p>The effect of superpixel based DenseCRF. Here we show a small patch of the original imagery from the Potsdam dataset (<b>a</b>) input imagery, (<b>b</b>) groundtruth, (<b>c</b>) superpixel segmentation to input imagery, (<b>d</b>) superpixel constraint to prediction map, (<b>e</b>) prediction map from our model and (<b>f</b>) prediction map after superpixel-based DenseCRF.</p> ">
Abstract
:1. Introduction
- We propose a novel convolutional neural network that combines the advantages of ASPP and encoder-decoder structures.
- We enhance the learning procedure by employing a multi-scale loss function.
- We improve the dense conditional random field with a superpixel algorithm to optimize the prediction further.
2. Methods
2.1. Encoder with ResNet and Atrous Spatial Pyramid Pooling
2.1.1. ResNet-101 as the Backbone
2.1.2. Atrous Spatial Pyramid Pooling
2.2. Decoder and the Multi-scale Loss Function
2.2.1. Proposed Decoder
2.2.2. Multi-scale Loss Function
2.3. Dense Conditional Random Fields Based on Superpixel
Algorithm 1. The process of CRF based on superpixel |
|
3. Results
3.1. Datasets
3.2. Preprocessing the Datasets
3.3. Training Protocol and Metrics
3.4. Experimental Results
4. Evaluation and Discussion
4.1. The Importance of Multi-scale Loss Function
4.2. Comparison to DeepLab_v3+ and Other the State-of-art Networks
4.3. The Influence of Superpixel-based DenseCRF
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
AdaBoost | Adaptive Boosting |
ASPP | Atrous Spatial Pyramid Pooling |
CNN | Convolutional Neural Network |
CRF | Conditional Random Field |
DSM | Digital Surface Model |
FCN | Fully Convolutional Networks |
HOG | Histogram of Oriented Gradients |
ISPRS | International of Electrical and Electronics Engineers |
SLIC | Simple Linear Iterative Clustering |
UAV | Unmanned Aerial Vehicle |
VGG | Visual Geometry Group |
References
- Moser, G.; Serpico, S.B.; Benediktsson, J.A. Land-cover mapping by Markov modeling of spatial-contextual information in very-high-resolution remote sensing images. Proc. IEEE 2013, 101, 631–651. [Google Scholar] [CrossRef]
- Xu, Y.; Wu, L.; Xie, Z.; Chen, Z. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens. 2018, 10, 144. [Google Scholar] [CrossRef]
- Xin, P.; Jian, Z. High-resolution remote sensing image classification method based on convolutional neural network and restricted conditional random field. Remote Sens. 2018, 10, 920. [Google Scholar]
- Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M. Semantic segmentation of aerial images with an ensemble of CNNs. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473. [Google Scholar] [CrossRef]
- Li, M.; Zang, S.; Zhang, B.; Li, S.; Wu, C. A review of remote sensing image classification techniques: The role of spatio-contextual information. Eur. J. Remote Sens. 2014, 47, 389–411. [Google Scholar] [CrossRef]
- Kampffmeyer, M.; Arnt-Borre, S.; Robert, J. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1–9. [Google Scholar]
- Michele, V.; Devis, T. Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 881–893. [Google Scholar]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the Computer IEEE Computer Society Conference on Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 90–110. [Google Scholar] [CrossRef]
- Herbert, B.; Andreas, E.; Tinne, T.; Luc, V.G. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar]
- Inglada, J. Automatic recognition of man-made objects in high resolution optical remote sending images by SVM classification of geometric image features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
- Mariana, B.; Lucian, D. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar]
- Turgay, C. Unsupervised change detection in satellite images using principal component analysis and k-means clustering. IEEE Geosci. Remote Sens. Lett. 2009, 3, 772–776. [Google Scholar] [CrossRef]
- Wang, H.; Wang, Y.; Zhang, Q.; Xiang, S.; Pan, C. Gated convolutional neural network for semantic segmentation in high-resolution images. Remote Sens. 2017, 9, 446. [Google Scholar] [CrossRef]
- Yansong, L.; Sankaranarayanan, P.; Sildomar, T.M.; Eli, S. Dense semantic labeling of very-high-resolution aerial imagery and LiDAR with fully-convolutional neural networks and higher-order CRFs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 76–85. [Google Scholar]
- Hyeonwoo, N.; Seunghoon, H.; Bohyung, H. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Washington, DC, USA, 3–7 December 2015; pp. 1520–1528. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Wei, X.; Fu, K.; Gao, X.; Yan, M.; Sun, X.; Chen, K.; Sun, H. Semantic pixel labelling in remote sensing images using a deep convolutional encoder-decoder model. Remote Sens. Lett. 2018, 9, 199–208. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Fu, G.; Liu, C.; Zhou, R.; Sun, T.; Zhang, Q. Classification for High Resolution Remote Sensing Imagery Using a Fully Convolutional Network. Remote Sens. 2017, 9, 498. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters-improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1743–1751. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv, 2018; arXiv:1802.02611. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 1–9. [Google Scholar]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliex, P. Fully convolutional networks for remote sensing image classification. In Proceedings of the IEEE International Conference on Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 5071–5074. [Google Scholar]
- Fisher, Y.; Vladlen, K. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv, 2015; arXiv:1511.07122. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv, 2017; arXiv:1706.05587. [Google Scholar]
- Shore, J.; Johnson, R. Properties of cross-entropy minimization. IEEE Trans. Inf. Theory 1987, 27, 472–482. [Google Scholar] [CrossRef]
- Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics, Paris, France, 22–27 August 2010; pp. 177–186. [Google Scholar]
- Liu, F.Y.; Lin, G.S.; Shen, C.H. CRF learning with CNN features for image segmentation. Pattern Recognit. 2015, 48, 2988–2992. [Google Scholar] [CrossRef]
- Alam, F.I.; Zhou, J.; Liew, A.W.C.; Jia, X.P. CRF learning with CNN features for hyperspectral image segmentation. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 6890–6893. [Google Scholar]
- Wu, F. The potts model. Rev. Mod. Phys. 1982, 54, 235. [Google Scholar] [CrossRef]
- Achanta, R.; Shajji, A.; Smith, K.; Lucchi, A.; Fua, P.; Susstrunk, S. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Math. Intell. 2012, 34, 2274–2281. [Google Scholar] [CrossRef] [PubMed]
- Van den Bergh, M.; Boix, X.; Roig, G.; de Capitani, B.; Van Gool, L. Seeds: Superpixels extracted via energy-driven sampling. In Proceedings of the 12th European Conference on Computer Vision-Volume Part VII, Florence, Italy, 7–13 October 2012; Springer: New York, NY, USA, 2012; pp. 13–26. [Google Scholar]
- Gerke, M. Use of the Stair Vision Library within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); Technical Report; University of Twente: Enschede, The Netherlands, 2015. [Google Scholar]
- Liu, Y.; Ren, Q.; Geng, J.; Ding, M.; Li, J. Efficient Patch-Wise Semantic Segmentation for Large-Scale Remote Sensing Images. Sensors 2018, 18, 3232. [Google Scholar] [CrossRef] [PubMed]
- Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv, 2013; arXiv:1312.4400. [Google Scholar]
- Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. Tensorflow: A system for large-scale machines learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, Savannan, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv, 2014; arXiv:1412.7062. [Google Scholar]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–308. [Google Scholar] [CrossRef]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3213–3223. [Google Scholar]
- Sherrah, J. Fully convolution networks for dense semantic labelling of high-resolution aerial imagery. arXiv, 2016; arXiv:1606.02585. [Google Scholar]
- Piramanayagam, S.; Saber, E.; Schwartzkopf, W.; Koehler, F. Supervised Classification of Multisensor Remotely Sensed Images Using a Deep Learning Framework. Remote Sens. 2018, 10, 1429. [Google Scholar] [CrossRef]
- Zhao, W.; Fu, Y.; Wei, X.; Wang, H. An Improved Image Semantic Segmentation Method Based on Superpixels and Conditional Random Fields. Appl. Sci. 2018, 8, 837. [Google Scholar] [CrossRef]
Layer | Type | Kernel Size | Resolution | Connect to | |
---|---|---|---|---|---|
ResNet 101 | conv_1 | convolution | , 128 | block1 | |
block1 | residual_block | -, 256 | block2 & conv2_2 | ||
block2 | residual_block | -, 512 | block3 & conv1_2 | ||
block3 | residual_block | -, 1024 | block4 | ||
block4 | residual_block | -, 2048 | ASPP | ||
ASPP | branch1 | convolution | , 256 | concat_1 | |
branch2 | atrous_conv | , rate = 6, 256 | concat_1 | ||
branch3 | atrous_conv | , rate = 12, 256 | concat_1 | ||
branch4 | atrous_conv | , rate = 18, 256 | concat_1 | ||
branch5 | global_pooling | -, 256 | concat_1 | ||
concat_1 | concatenation | -, 1280 | conv1_1 | ||
conv1_1 | convolution | , 256 | up_1 | ||
Decoder | up_1 | upsample | -, 256 | concat_2 & conv1_3 | |
conv1_2 | convolution | , 48 | concat_2 | ||
conv1_3 | convolution | , 256 | conv1_4 | ||
conv1_4 | convolution | , 6 | |||
concat_2 | concatenation | -, 304 | conv2_1 | ||
conv2_1 | convolution | , 256 | up_2 | ||
up_2 | upsample | -, 256 | concat_3 | ||
conv2_2 | convolution | , 48 | concat_3 | ||
concat_3 | concatenation | -, 304 | conv3_1 | ||
conv3_1 | convolution | , 256 | conv3_2 | ||
conv3_2 | convolution | , 6 | up_3 | ||
up_3 | upsample | -, 6 |
Metrics | Imp_surf | Building | Low_veg | Tree | Car | Average |
---|---|---|---|---|---|---|
OA | N/A | N/A | N/A | N/A | N/A | 0.883 |
precision | 0.889 | 0.946 | 0.827 | 0.853 | 0.912 | 0.885 |
recall | 0.916 | 0.972 | 0.853 | 0.840 | 0.881 | 0.893 |
0.902 | 0.959 | 0.839 | 0.843 | 0.896 | 0.888 |
Metrics | Imp_surf | Building | Low_veg | Tree | Car | Average |
---|---|---|---|---|---|---|
OA | N/A | N/A | N/A | N/A | N/A | 0.867 |
precision | 0.877 | 0.912 | 0.790 | 0.838 | 0.785 | 0.840 |
recall | 0.887 | 0.826 | 0.766 | 0.873 | 0.712 | 0.833 |
0.881 | 0.917 | 0.776 | 0.852 | 0.739 | 0.833 |
Potsdam | Precision | Recall | OA | |
---|---|---|---|---|
single loss | 0.884 | 0.888 | 0.886 | 0.879 |
multi-scale loss | 0.885 | 0.893 | 0.888 | 0.883 |
Vaihingen | Precision | Recall | OA | |
single loss | 0.837 | 0.827 | 0.830 | 0.858 |
multi-scale loss | 0.840 | 0.833 | 0.833 | 0.867 |
Potsdam | Precision | Recall | OA | |
---|---|---|---|---|
DeepLab_v3+ [44] | 0.882 | 0.889 | 0.884 | 0.880 |
Ours | 0.885 | 0.893 | 0.888 | 0.883 |
Vaihingen | Precision | Recall | OA | |
DeepLab_v3+ [44] | 0.837 | 0.829 | 0.830 | 0.864 |
Ours | 0.840 | 0.833 | 0.833 | 0.867 |
Method | Precision | Recall | OA | |
---|---|---|---|---|
SVL_1 | 0.763 | 0.703 | 0.721 | 0.754 |
FCN [23] | 0.807 | 0.823 | 0.812 | 0.824 |
DST_5 [48] | 0.886 | 0.884 | 0.885 | 0.878 |
RIT6 [49] | 0.886 | 0.892 | 0.888 | 0.879 |
U-net [27] | 0.859 | 0.881 | 0.867 | 0.860 |
DeepLab_v3 [32] | 0.881 | 0.886 | 0.882 | 0.878 |
DeepLab_v3+ [44] | 0.882 | 0.889 | 0.884 | 0.880 |
Ours | 0.885 | 0.893 | 0.888 | 0.883 |
Potsdam | Precision | Recall | OA | |
---|---|---|---|---|
Before Superpixel-CRF | 0.885 | 0.893 | 0.888 | 0.883 |
After Superpixel-CRF | 0.888 | 0.892 | 0.889 | 0.884 |
Vaihingen | Precision | Recall | OA | |
Before Superpixel-CRF | 0.840 | 0.833 | 0.833 | 0.867 |
After Superpixel-CRF | 0.847 | 0.833 | 0.835 | 0.870 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Liang, B.; Ding, M.; Li, J. Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery. Remote Sens. 2019, 11, 20. https://doi.org/10.3390/rs11010020
Wang Y, Liang B, Ding M, Li J. Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery. Remote Sensing. 2019; 11(1):20. https://doi.org/10.3390/rs11010020
Chicago/Turabian StyleWang, Yuhao, Binxiu Liang, Meng Ding, and Jiangyun Li. 2019. "Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery" Remote Sensing 11, no. 1: 20. https://doi.org/10.3390/rs11010020
APA StyleWang, Y., Liang, B., Ding, M., & Li, J. (2019). Dense Semantic Labeling with Atrous Spatial Pyramid Pooling and Decoder for High-Resolution Remote Sensing Imagery. Remote Sensing, 11(1), 20. https://doi.org/10.3390/rs11010020