Efficient Transformer for Remote Sensing Image Segmentation
"> Figure 1
<p>Flops vs. mIoU on the Potsdam and Vaihingen datasets.</p> "> Figure 2
<p>The overall framework of the Swin transformer (Swin-T).</p> "> Figure 3
<p>An illustration of the shifted window approach.</p> "> Figure 4
<p>The architecture of uperhead.</p> "> Figure 5
<p>The overall framework of the Efficient transformer (Efficient-T).</p> "> Figure 6
<p>The architecture of mlphead.</p> "> Figure 7
<p>Examples of uncertain edge definitions.</p> "> Figure 8
<p>Detailed structure of explicit edge enhancement method.</p> "> Figure 9
<p>Architecture of the CNN-based edge extractor.</p> "> Figure 10
<p>Illustration of the implicit edge enhancement method.</p> "> Figure 11
<p>Epoch vs. Loss of Adamw and SGD optimizers.</p> "> Figure 12
<p>Visualization of C1, C2, C3, and C4 features of categories building (<b>top</b>) and car (<b>bottom</b>).</p> "> Figure 13
<p>Visualization of explicit and implicit edge enhancement methods.</p> "> Figure 14
<p>Prediction maps of the compared methods on the Vaihingen dataset.</p> "> Figure 15
<p>Prediction maps of the compared methods on the Potsdam dataset.</p> "> Figure 16
<p>Comparison of the improvement on blurry areas.</p> ">
Abstract
:1. Introduction
- A Swin transformer was introduced to better establish the global relations and proved effective to achieve state-of-the-art performance on the remote-sensing Potsdam and Vaihingen datasets at the first attempt.
- A lightweight Efficient transformer backbone and pure transformer mlphead were proposed to reduce the computation load of Swin transformer and accelerate the inference speed.
- Explicit edge enhancement and implicit edge enhancement methods proposed to cope with the object edge extraction problem in the transformer architecture.
2. Related Work
3. Methods
3.1. Investigation of Basic Swin Transformer Backbone and Uperhead
3.1.1. Swin Transformer Backbone
3.1.2. Uperhead Introduction
3.2. Efficient Architecture Design
3.2.1. Efficient Transformer Backbone
3.2.2. Mlphead Design
3.3. Edge Processing
3.3.1. Explicit Edge Enhancement
3.3.2. Implicit Edge Enhancement
4. Experimental Results
4.1. Datasets and Experimental Settings
4.2. Study for Swin Transformer
4.2.1. Study for Pre-Trained Weights
4.2.2. Study for Optimizer
4.2.3. Study for Segmentation Head
4.3. Efficient Transformer Backbone and Mlphead
4.4. Edge Processing Methods
4.5. Comparison to SOTA Methods
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Witharana, C.; Bhuiyan, M.A.E.; Liljedahl, A.K.; Kanevskiy, M.; Epstein, H.E.; Jones, B.M.; Daanen, R.; Griffin, C.G.; Kent, K.; Jones, M.K.W. Understanding the synergies of deep learning and data fusion of multispectral and panchromatic high resolution commercial satellite imagery for automated ice-wedge polygon detection. ISPRS J. Photogramm. Remote Sens. 2020, 170, 174–191. [Google Scholar] [CrossRef]
- Zhang, T.; Su, J.; Liu, C.; Chen, W.H. State and parameter estimation of the AquaCrop model for winter wheat using sensitivity informed particle filter. Comput. Electron. Agric. 2021, 180, 105909. [Google Scholar] [CrossRef]
- Zhang, J.; Lin, S.; Ding, L.; Bruzzone, L. Multi-Scale Context Aggregation for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2020, 12, 701. [Google Scholar] [CrossRef] [Green Version]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. arXiv 2021, arXiv:2101.01169. [Google Scholar]
- Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-resolution context extraction network for semantic segmentation of remote sensing images. Remote Sens. 2021, 13, 71. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
- Zhang, Q.; Yang, Y. ResT: An Efficient Transformer for Visual Recognition. arXiv 2021, arXiv:2105.13677. [Google Scholar]
- Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Keysers, D.; Uszkoreit, J.; Lucic, M.; et al. Mlp-mixer: An all-mlp architecture for vision. arXiv 2021, arXiv:2105.01601. [Google Scholar]
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. arXiv 2021, arXiv:2104.13840. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing And Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Jin, Y.; Xu, W.; Zhang, C.; Luo, X.; Jia, H. Boundary-aware refined network for automatic building extraction in very high-resolution urban aerial images. Remote Sens. 2021, 13, 692. [Google Scholar] [CrossRef]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Cision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation. arXiv 2020, arXiv:2004.02147. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]
- Yang, G.; Zhang, Q.; Zhang, G. EANet: Edge-Aware Network for the Extraction of Buildings from Aerial Images. Remote Sens. 2020, 12, 2161. [Google Scholar] [CrossRef]
- Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for labeling pixels and regions. arXiv 2019, arXiv:1904.04514. [Google Scholar]
- Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 418–434. [Google Scholar]
- Haut, J.M.; Bernabé, S.; Paoletti, M.E.; Fernandez-Beltran, R.; Plaza, A.; Plaza, J. Low–high-power consumption architectures for deep-learning models applied to hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2018, 16, 776–780. [Google Scholar] [CrossRef]
- Zhang, C.; Jiang, W.; Zhao, Q. Semantic Segmentation of Aerial Imagery via Split-Attention Networks with Disentangled Nonlocal and Edge Supervision. Remote Sens. 2021, 13, 1176. [Google Scholar] [CrossRef]
- Zhang, T.; Su, J.; Xu, Z.; Luo, Y.; Li, J. Sentinel-2 satellite imagery for urban land cover classification by optimized random forest classifier. Appl. Sci. 2021, 11, 543. [Google Scholar] [CrossRef]
- Yuan, W.; Zhang, W.; Lai, Z.; Zhang, J. Extraction of Yardang characteristics using object-based image analysis and canny edge detection methods. Remote Sens. 2020, 12, 726. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Sugano, H.; Miyamoto, R. Parallel implementation of morphological processing on cell/BE with OpenCV interface. In Proceedings of the 2008 3rd International Symposium on Communications, Control and Signal Processing, St. Julians, Malta, 12–14 March 2008; IEEE: New York, NY, USA, 2008; pp. 578–583. [Google Scholar]
- He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yin, M.; Yao, Z.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
- Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 633–641. [Google Scholar]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Wang, J.; Shen, L.; Qiao, W.; Dai, Y.; Li, Z. Deep feature fusion with integration of residual connection and attention model for classification of VHR remote sensing images. Remote Sens. 2019, 11, 1617. [Google Scholar] [CrossRef] [Green Version]
Backbone | Edge | Recall (%) | Precision (%) | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|---|
Swin-B [10] | w/o | 86.50 | 86.77 | 86.63 | 88.01 | 76.75 |
w | 98.15 | 97.28 | 97.71 | 98.08 | 95.55 |
Pre-Trained Weight | Pre-Trained Size | Windows Size | Epoch | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|---|
w/o | / | 7 | 100 | 81.99 | 85.58 | 70.17 |
/ | 7 | 300 | 82.85 | 85.76 | 71.26 | |
/ | 7 | 1000 | 83.12 | 85.83 | 71.61 | |
ImageNet 1k [41] | 224 × 224 | 7 | 100 | 86.23 | 87.60 | 76.11 |
384 × 384 | 12 | 100 | 86.25 | 87.68 | 76.11 | |
ImageNet 22k [41] | 224 × 224 | 7 | 100 | 86.63 | 88.01 | 76.75 |
384 × 384 | 12 | 100 | 86.58 | 88.04 | 76.68 | |
ImageNet 22k to 1k [41] | 224 × 224 | 7 | 100 | 86.64 | 87.94 | 76.73 |
384 × 384 | 12 | 100 | 86.47 | 87.91 | 76.49 | |
Ade20k [42] | 512 × 512 | 7 | 100 | 86.16 | 87.59 | 76.03 |
Potsdam | 512 × 512 | 7 | 100 | 82.32 | 86.70 | 71.56 |
Pre-Trained Weight | Pre-Trained Size | Training Size | Windows Size | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|---|
ImageNet 22k [41] | 224 × 224 | 224 × 224 | 7 | 85.98 | 87.37 | 75.72 |
224 × 224 | 512 × 512 | 7 | 86.63 | 88.01 | 76.75 | |
224 × 224 | 768 × 768 | 7 | 86.66 | 88.23 | 76.81 | |
384 × 384 | 384 × 384 | 12 | 86.47 | 87.98 | 76.49 | |
384 × 384 | 512 × 512 | 12 | 86.58 | 88.04 | 76.68 |
Method | Optimizer | Aux | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|
Swin-B [10] | AdamW [45] | w | 86.63 | 88.01 | 76.75 |
AdamW [45] | w/o | 86.18 | 87.55 | 76.00 | |
SGD [44] | w | 85.77 | 87.73 | 75.52 | |
SGD [44] | w/o | 85.40 | 87.66 | 75.00 |
Head | Input Layers | F1 (%) | OA (%) | mIoU (%) | Flops (G) | Params (M) |
---|---|---|---|---|---|---|
apchead [37] | C4 | 82.32 | 86.70 | 71.06 | 93.01 | 109.47 |
aspphead [16] | C4 | 81.39 | 86.28 | 69.84 | 89.26 | 91.96 |
asppplushead [38] | C1, C4 | 86.46 | 87.92 | 76.47 | 111.03 | 93.14 |
dahead [19] | C4 | 82.27 | 86.66 | 70.96 | 89.85 | 93.91 |
dnlhead [39] | C4 | 82.34 | 86.67 | 71.07 | 90.24 | 95.54 |
fpnhead [18] | C1, C2, C3, C4 | 86.51 | 87.92 | 76.56 | 179.99 | 97.01 |
fcnhead [6] | C4 | 82.34 | 86.69 | 71.06 | 90.07 | 95.01 |
gchead [20] | C4 | 82.19 | 86.60 | 70.86 | 90.08 | 95.14 |
psahead [40] | C4 | 82.22 | 86.69 | 70.95 | 90.86 | 97.99 |
psphead [17] | C4 | 82.15 | 86.55 | 70.80 | 89.49 | 93.70 |
seghead [29] | C1, C2, C3, C4 | 86.39 | 87.84 | 76.38 | 97.20 | 90.41 |
unethead [14] | C1, C2, C3, C4 | 86.35 | 87.85 | 76.33 | 153.81 | 89.84 |
uperhead [30] | C1, C2, C3, C4 | 86.63 | 88.01 | 76.75 | 299.42 | 121.17 |
mlphead | C1, C2, C3, C4 | 86.38 | 87.90 | 76.37 | 95.22 | 88.89 |
Head | Input Layers | Stride | F1 (%) | OA (%) | mIoU (%) | Flops (G) | Params (M) |
---|---|---|---|---|---|---|---|
aspphead [16] | C1 | 4 | 77.22 | 83.35 | 62.88 | 80.65 | 87.99 |
C2 | 8 | 82.82 | 85.57 | 71.21 | 80.64 | 88.18 | |
C3 | 16 | 85.59 | 87.74 | 75.29 | 80.63 | 88.94 | |
C4 | 32 | 81.39 | 86.28 | 69.84 | 89.26 | 91.96 | |
dahead [19] | C1 | 4 | 79.76 | 83.32 | 66.92 | 90.88 | 88.02 |
C2 | 8 | 83.83 | 85.92 | 72.51 | 82.41 | 88.30 | |
C3 | 16 | 85.62 | 87.70 | 75.31 | 81.35 | 89.42 | |
C4 | 32 | 82.27 | 86.66 | 70.96 | 89.85 | 93.91 |
Backbone | Head | F1 (%) | OA (%) | mIoU (%) | Flops (G) | Params (M) | Speed (img/s) |
---|---|---|---|---|---|---|---|
Swin-B [10] | uperhead [30] | 86.63 | 88.01 | 76.75 | 299.42 | 121.17 | 18.9 |
mlphead | 86.38 | 87.90 | 76.37 | 95.22 | 88.89 | 29.8 | |
Efficient-B | uperhead [30] | 86.23 | 87.62 | 76.09 | 238.72 | 61.87 | 24.7 |
mlphead | 85.92 | 87.38 | 75.61 | 35.04 | 31.28 | 47.9 |
Method | IEE | EEE | F1 (%) | OA (%) | mIoU (%) | |
---|---|---|---|---|---|---|
ES | EEL | AEF | ||||
Efficient-T | 85.19 | 86.90 | 74.56 | |||
✓ | 85.30 | 87.02 | 74.74 | |||
✓ | 85.29 | 86.92 | 74.69 | |||
✓ | ✓ | 85.50 | 87.25 | 75.04 | ||
✓ | 85.20 | 87.11 | 74.92 | |||
✓ | ✓ | ✓ | 85.57 | 87.24 | 75.15 |
Dataset | Method | Recall (%) | Precision (%) | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|---|
Vaihingen | Efficient-T | 84.84 | 85.54 | 85.19 | 86.90 | 74.56 |
Efficient-T † | 85.43 | 85.71 | 85.57 | 87.24 | 75.15 | |
Potsdam | Efficient-T | 89.24 | 88.62 | 88.93 | 88.04 | 80.35 |
Efficient-T † | 90.05 | 89.21 | 89.63 | 88.66 | 81.41 |
Method | Recall (%) | Precision (%) | F1 (%) | OA (%) | mIoU (%) | Flops (G) | Params (M) |
---|---|---|---|---|---|---|---|
FCN [6] | 79.03 | 80.84 | 79.92 | 84.41 | 67.58 | 80.34 | 15.31 |
FPN [18] | 79.65 | 85.06 | 82.26 | 85.60 | 69.95 | 47.00 | 27.71 |
PSPNet [17] | 85.27 | 86.40 | 85.83 | 87.27 | 75.47 | 177.38 | 48.63 |
UNet [14] | 83.12 | 84.16 | 83.64 | 85.95 | 72.33 | 124.12 | 13.40 |
BiseNet_v2 [23] | 77.06 | 80.96 | 78.96 | 83.83 | 66.05 | 12.92 | 3.63 |
DeepLab_v3 [16] | 85.25 | 86.51 | 85.88 | 87.34 | 75.54 | 163.86 | 41.68 |
DANet [19] | 85.12 | 86.29 | 85.70 | 87.20 | 75.29 | 209.76 | 49.49 |
Swin-T [10] | 85.90 | 86.53 | 86.21 | 87.69 | 76.09 | 236.90 | 59.83 |
Swin-S [10] | 86.01 | 86.68 | 86.35 | 87.70 | 76.29 | 260.66 | 81.15 |
Swin-B [10] | 86.50 | 86.77 | 86.63 | 88.01 | 76.75 | 299.42 | 121.17 |
Swin-L [10] | 86.58 | 86.96 | 86.77 | 88.14 | 76.97 | 408.75 | 233.65 |
Efficient-T | 85.43 | 85.71 | 85.57 | 87.24 | 75.15 | 16.78 | 11.30 |
Efficient-S | 85.21 | 86.03 | 85.62 | 87.20 | 75.17 | 20.18 | 14.47 |
Efficient-B | 85.38 | 86.46 | 85.92 | 87.38 | 75.61 | 35.04 | 31.28 |
Efficient-L | 86.10 | 87.56 | 75.92 | 56.64 | 52.63 |
Method | Recall (%) | Precision (%) | F1 (%) | OA (%) | mIoU (%) | Flops (G) | Params (M) |
---|---|---|---|---|---|---|---|
FCN [6] | 86.24 | 85.54 | 85.89 | 85.30 | 75.37 | 80.34 | 15.31 |
FPN [18] | 85.67 | 88.06 | 86.85 | 86.16 | 76.38 | 47.00 | 27.71 |
PSPNet [17] | 90.06 | 89.35 | 89.70 | 88.75 | 81.52 | 177.38 | 48.63 |
UNet [14] | 88.86 | 88.18 | 88.52 | 87.13 | 79.52 | 124.12 | 13.40 |
BiseNet_v2 [23] | 82.92 | 81.12 | 82.01 | 81.02 | 69.65 | 12.92 | 3.63 |
DeepLab_v3 [16] | 90.31 | 89.59 | 89.95 | 88.85 | 81.92 | 163.86 | 41.68 |
DANet [19] | 90.29 | 89.65 | 89.97 | 89.00 | 81.96 | 209.76 | 49.49 |
Swin-T [10] | 90.93 | 89.95 | 90.44 | 89.39 | 82.73 | 236.90 | 59.83 |
Swin-S [10] | 90.98 | 90.05 | 90.51 | 89.47 | 82.84 | 260.66 | 81.15 |
Swin-B [10] | 91.10 | 90.06 | 90.58 | 89.46 | 82.95 | 299.42 | 121.17 |
Swin-L [10] | 91.26 | 90.19 | 90.72 | 89.65 | 83.22 | 408.75 | 233.65 |
Efficient-T | 90.05 | 89.21 | 89.63 | 88.66 | 81.41 | 16.78 | 11.30 |
Efficient-S | 90.42 | 89.48 | 89.95 | 89.05 | 81.93 | 20.18 | 14.47 |
Efficient-B | 90.70 | 89.69 | 90.19 | 89.25 | 82.34 | 35.04 | 31.28 |
Efficient-L | 90.89 | 89.92 | 90.40 | 89.44 | 82.68 | 56.64 | 52.63 |
Dataset | Method | Recall (%) | Precision (%) | F1 (%) | OA (%) | mIoU (%) |
---|---|---|---|---|---|---|
Vaihingen | HUSTW5 | 83.32 | 86.20 | 84.50 | 88.60 | / |
HRCNet_W48 [5] | 86.29 | 86.47 | 86.07 | 88.56 | 74.11 | |
Swin-L [10] | 87.29 | 87.97 | 87.60 | 88.85 | 78.26 | |
Efficient-L | 86.29 | 87.89 | 87.01 | 88.41 | 77.34 | |
Potsdam | SWJ_2 [46] | 89.40 | 89.82 | 89.58 | 89.40 | / |
HRCNet_W48 [5] | 90.69 | 89.90 | 90.20 | 89.50 | 81.20 | |
Swin-L [10] | 91.38 | 90.61 | 90.94 | 90.02 | 83.60 | |
Efficient-L | 91.52 | 90.49 | 90.99 | 90.08 | 83.66 |
Dataset | Method | Imp Surf | Building | Low Veg | Tree | Car | mIoU (%) |
---|---|---|---|---|---|---|---|
Vaihingen | HRCNet_W48 [5] | 81.05 | 86.65 | 66.91 | 76.63 | 59.31 | 74.11 |
Swin-L [10] | 83.35 | 89.86 | 69.45 | 77.63 | 71.02 | 78.26 | |
Efficient-L | 82.75 | 88.75 | 68.66 | 77.32 | 69.24 | 77.34 | |
Potsdam | HRCNet_W48 [5] | 83.58 | 91.15 | 73.07 | 74.88 | 83.32 | 81.20 |
Swin-L [10] | 86.13 | 93.21 | 76.08 | 76.75 | 85.85 | 83.60 | |
Efficient-L | 85.82 | 93.09 | 76.38 | 78.02 | 85.01 | 83.66 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens. 2021, 13, 3585. https://doi.org/10.3390/rs13183585
Xu Z, Zhang W, Zhang T, Yang Z, Li J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sensing. 2021; 13(18):3585. https://doi.org/10.3390/rs13183585
Chicago/Turabian StyleXu, Zhiyong, Weicun Zhang, Tianxiang Zhang, Zhifang Yang, and Jiangyun Li. 2021. "Efficient Transformer for Remote Sensing Image Segmentation" Remote Sensing 13, no. 18: 3585. https://doi.org/10.3390/rs13183585
APA StyleXu, Z., Zhang, W., Zhang, T., Yang, Z., & Li, J. (2021). Efficient Transformer for Remote Sensing Image Segmentation. Remote Sensing, 13(18), 3585. https://doi.org/10.3390/rs13183585