IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection
<p>Infrared small targets in developed IRST640 dataset of sky (<b>a</b>) and building (<b>b</b>) scenes. The green and red box is the target and the false alarm, respectively.</p> "> Figure 2
<p>The architecture of the proposed IRSTFormer, which consists of a four-stage encoder hierarchical overlapped small patch transformer (HOSPT), a progressive decoder with top-down feature aggregation modules (TFAM), and the combined BCE and softIoU (CBS) loss.</p> "> Figure 3
<p>The architecture of each stage in the hierarchical overlapped small patch transformer (HOSPT). The overlapped small patch embedding (OSPE) divides the input feature map into different patches and conducts linear projection to obtain the two-dimension feature embedding. In the self-attention layer, attention features is calculated in form of dot-product. A <math display="inline"><semantics> <mrow> <mn>3</mn> <mo>×</mo> <mn>3</mn> </mrow> </semantics></math> convolution layer is utilized in the feed-forward network (FFN). The layer normalization (LN) is utilized to normalize the feature.</p> "> Figure 4
<p>The architecture of the top-down feature aggregation module (TFAM). It mainly consists of the multilayer perceptron (MLP) and the channel-attention block.</p> "> Figure 5
<p>The gradient of the softIoU loss for negative background samples (<math display="inline"><semantics> <mrow> <mi>t</mi> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>) and positive target samples (<math display="inline"><semantics> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>).</p> "> Figure 6
<p>The output <span class="html-italic">x</span> of the network at the first and middle epoch of the training.</p> "> Figure 7
<p>The segmentation masks of 14 methods corresponding to the same image of cloud (<b>a</b>), ground (<b>b</b>–<b>d</b>), and noisy pixels (<b>e</b>) scene. The green and red circle is the detected target and false alarm, respectively.</p> "> Figure 8
<p>The 3D segmentation masks of 14 methods corresponding to the same image of cloud (<b>a</b>), ground (<b>b</b>–<b>d</b>), and noisy pixels (<b>e</b>) scene.</p> ">
Abstract
:1. Introduction
- A hierarchical vision transformer is purposed to detect infrared small targets, which removes the intrinsic shortcomings of existed methods;
- A simple yet effective combination of existing loss functions is exploited to optimize the network convergence;
- Experiments on public SIRST dataset and our developed IRST640 dataset demonstrate the superiority of our method over other state-of-the-art methods.
2. Related Work
2.1. Detection-Based Infrared Small Target Detection
2.2. Segmentation-Based Infrared Small Target Detection
2.3. Attention Mechanism
2.4. Transformer for Computer Vision
3. Method
3.1. Network Architecture
3.2. Hierarchical Overlapped Small Patch Transformer
3.3. Top-Down Feature Aggregation Module
3.4. Loss Function
4. Result
4.1. Experimental Setting
4.1.1. Dataset
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparison to the State-of-the-Art Methods
4.2.1. Qualitative Results
4.2.2. Quantitative Results
5. Discussion
5.1. Ablation Study
5.2. Different Parameters in the OSPE
5.3. Different Forms of Combination of the BCE and SoftIoU Loss
5.4. Different Training Sets
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tartakovsky, A.G.; Kligys, S.; Petrov, A. Adaptive sequential algorithms for detecting targets in a heavy IR clutter. In Proceedings of the Signal and Data Processing of Small Targets 1999, Denver, CO, USA, 4 October 1999; Volume 3809, pp. 119–130. [Google Scholar]
- Gao, J.; Guo, Y.; Lin, Z.; An, W.; Li, J. Robust infrared small target detection using multiscale gray and variance difference measures. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 5039–5052. [Google Scholar] [CrossRef]
- Li, Y.; Li, Z.; Zhang, C.; Luo, Z.; Zhu, Y.; Ding, Z.; Qin, T. Infrared maritime dim small target detection based on spatiotemporal cues and directional morphological filtering. Infrared Phys. Technol. 2021, 115, 103657. [Google Scholar] [CrossRef]
- Tom, V.T.; Peli, T.; Leung, M.; Bondaryk, J.E. Morphology-based algorithm for point target detection in infrared backgrounds. In Proceedings of the Signal and Data Processing of Small Targets, Orlando, FL, USA, 12–14 April 1993; Volume 1954, pp. 2–11. [Google Scholar]
- Deshpande, S.D.; Er, M.H.; Venkateswarlu, R.; Chan, P. Max-mean and max-median filters for detection of small targets. In Proceedings of the Signal and Data Processing of Small Targets, Denver, CO, USA, 20–22 July 1999; Volume 3809, pp. 74–83. [Google Scholar]
- Han, J.; Moradi, S.; Faramarzi, I.; Liu, C.; Zhang, H.; Zhao, Q. A local contrast method for infrared small-target detection utilizing a tri-layer window. IEEE Geosci. Remote Sens. Lett. 2019, 17, 1822–1826. [Google Scholar] [CrossRef]
- Han, J.; Liang, K.; Zhou, B.; Zhu, X.; Zhao, J.; Zhao, L. Infrared small target detection utilizing the multiscale relative local contrast measure. IEEE Geosci. Remote Sens. Lett. 2018, 15, 612–616. [Google Scholar] [CrossRef]
- Aghaziyarati, S.; Moradi, S.; Talebi, H. Small infrared target detection using absolute average difference weighted by cumulative directional derivatives. Infrared Phys. Technol. 2019, 101, 78–87. [Google Scholar] [CrossRef]
- Zhang, L.; Peng, Z. Infrared small target detection based on partial sum of the tensor nuclear norm. Remote Sens. 2019, 11, 382. [Google Scholar] [CrossRef] [Green Version]
- Gao, C.; Zhang, T.; Li, Q. Small infrared target detection using sparse ring representation. IEEE Aerosp. Electron. Syst. Mag. 2012, 27, 21–30. [Google Scholar]
- Dai, Y.; Wu, Y.; Song, Y. Infrared small target and background separation via column-wise weighted robust principal component analysis. Infrared Phys. Technol. 2016, 77, 421–430. [Google Scholar] [CrossRef]
- Wang, H.; Zhou, L.; Wang, L. Miss detection vs. false alarm: Adversarial learning for small object segmentation in infrared images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8509–8518. [Google Scholar]
- Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Asymmetric contextual modulation for infrared small target detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 950–959. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Ju, M.; Luo, J.; Liu, G.; Luo, H. ISTDet: An efficient end-to-end neural network for infrared small target detection. Infrared Phys. Technol. 2021, 114, 103659. [Google Scholar] [CrossRef]
- Du, J.; Lu, H.; Zhang, L.; Hu, M.; Chen, S.; Deng, Y.; Shen, X.; Zhang, Y. A Spatial-Temporal Feature-Based Detection Framework for Infrared Dim Small Target. IEEE Trans. Geosci. Remote Sens. 2021, 60, 3000412. [Google Scholar] [CrossRef]
- Ding, L.; Xu, X.; Cao, Y.; Zhai, G.; Yang, F.; Qian, L. Detection and tracking of infrared small target by jointly using SSD and pipeline filter. Digit. Signal Process. 2021, 110, 102949. [Google Scholar] [CrossRef]
- Chen, G.; Wang, W. Target recognition in infrared circumferential scanning system via deep convolutional neural networks. Sensors 2020, 20, 1922. [Google Scholar] [CrossRef] [Green Version]
- Fang, H.; Xia, M.; Zhou, G.; Chang, Y.; Yan, L. Infrared small UAV target detection based on residual image prediction via global and local dilated residual networks. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Zhao, M.; Cheng, L.; Yang, X.; Feng, P.; Liu, L.; Wu, N. TBC-Net: A real-time detector for infrared small target detection using semantic constraint. arXiv 2019, arXiv:2001.05852. [Google Scholar]
- Zhao, B.; Wang, C.; Fu, Q.; Han, Z. A novel pattern for infrared small target detection with generative adversarial network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4481–4492. [Google Scholar] [CrossRef]
- Dai, Y.; Wu, Y.; Zhou, F.; Barnard, K. Attentional local contrast networks for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 9813–9824. [Google Scholar] [CrossRef]
- Hou, Q.; Wang, Z.; Tan, F.; Zhao, Y.; Zheng, H.; Zhang, W. RISTDnet: Robust infrared small target detection network. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Zhang, T.; Cao, S.; Pu, T.; Peng, Z. AGPCNet: Attention-Guided Pyramid Context Networks for Infrared Small Target Detection. arXiv 2021, arXiv:2111.03580. [Google Scholar]
- Huang, L.; Dai, S.; Huang, T.; Huang, X.; Wang, H. Infrared small target segmentation with multiscale feature representation. Infrared Phys. Technol. 2021, 116, 103755. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Wei, Y.; You, X.; Li, H. Multiscale patch-based contrast measure for small infrared target detection. Pattern Recognit. 2016, 58, 216–226. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Dai, Y.; Oehmcke, S.; Gieseke, F.; Wu, Y.; Barnard, K. Attention as activation. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9156–9163. [Google Scholar]
- Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning texture transformer network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5791–5800. [Google Scholar]
- Zhou, H.Y.; Guo, J.; Zhang, Y.; Yu, L.; Wang, L.; Yu, Y. nnFormer: Interleaved Transformer for Volumetric Segmentation. arXiv 2021, arXiv:2109.03201. [Google Scholar]
- Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage attention ResU-Net for semantic segmentation of fine-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Liu, F.; Gao, C.; Chen, F.; Meng, D.; Zuo, W.; Gao, X. Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds. arXiv 2021, arXiv:2109.14379. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Rahman, M.A.; Wang, Y. Optimizing intersection-over-union in deep neural networks for image segmentation. In Proceedings of the International Symposium on Visual Computing, Las Vegas, NV, USA, 12–14 December 2016; pp. 234–244. [Google Scholar]
- Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. arXiv 2021, arXiv:2106.00487. [Google Scholar]
- Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Method | IRST640 | SIRST | ||||||
---|---|---|---|---|---|---|---|---|
() | () | |||||||
Tophat | 0.0836 | 0.144 | 0.324 | 2000 | 0.425 | 0.567 | 0.587 | 2000 |
Max-median | 0.0279 | 0.0516 | 0.396 | 3000 | 0.253 | 0.382 | 0.463 | 412.5 |
TLLCM | 0.241 | 0.383 | 0.450 | 71.06 | 0.283 | 0.411 | 0.424 | 203.1 |
RLCM | 0.464 | 0.616 | 0.896 | 228.3 | 0.339 | 0.470 | 0.684 | 2000 |
AAGD | 0.0648 | 0.116 | 0.216 | 135.5 | 0.165 | 0.271 | 0.348 | 157.0 |
PSTNN | 0.286 | 0.398 | 0.671 | 1000 | 0.626 | 0.726 | 0.785 | 1000 |
ResNetFPN | 0.806 | 0.884 | 0.985 | 67.51 | 0.664 | 0.763 | 0.872 | 582.4 |
ALCNet | 0.829 | 0.903 | 0.988 | 6.171 | 0.684 | 0.785 | 0.881 | 248.8 |
DNANet | 0.808 | 0.890 | 0.976 | 2.047 | 0.645 | 0.747 | 0.844 | 773.1 |
AGPCNet | 0.761 | 0.852 | 0.985 | 145.2 | 0.674 | 0.778 | 0.908 | 557.6 |
LSPM | 0.838 | 0.908 | 0.988 | 2.618 | 0.723 | 0.825 | 0.936 | 303.0 |
MDvsFA | 0.368 | 0.518 | 0.918 | 1000 | 0.332 | 0.472 | 0.906 | 9000 |
Segformer | 0.761 | 0.852 | 0.979 | 189.6 | 0.664 | 0.767 | 0.899 | 358.4 |
IRSTFormer | 0.856 | 0.920 | 0.988 | 1.496 | 0.758 | 0.859 | 0.991 | 57.66 |
Method | Complexity (GMac) | Parameters | Speed (FPS) |
---|---|---|---|
ResNetFPN | 15.28 | 374.56 K | 119 |
ALCNet | 14.52 | 384.79 K | 53 |
DNANet | 56.41 | 4.70 M | 11 |
AGPCNet | 172.54 | 12.36 M | 7 |
LSPM | 246.25 | 31.58 M | 30 |
MDvsFA | 988.44 | 3.77 M | 5 |
Segformer | 6.74 | 3.71 M | 65 |
IRSTFormer | 111.06 | 4.82 M | 22 |
HOSPT | TFAM | CBS Loss | IRST640 | |||
---|---|---|---|---|---|---|
() | ||||||
0.761 | 0.852 | 0.979 | 189.6 | |||
✓ | 0.798 | 0.882 | 0.988 | 27.30 | ||
✓ | ✓ | 0.828 | 0.902 | 0.991 | 16.08 | |
✓ | ✓ | ✓ | 0.856 | 0.92 | 0.991 | 1.496 |
HOSPT | TFAM | CBS Loss | SIRST | |||
---|---|---|---|---|---|---|
() | ||||||
0.664 | 0.767 | 0.899 | 358.4 | |||
✓ | 0.728 | 0.836 | 0.972 | 571.3 | ||
✓ | ✓ | 0.731 | 0.836 | 0.963 | 270.6 | |
✓ | ✓ | ✓ | 0.743 | 0.845 | 0.982 | 124.6 |
Patch Size | Stride | IRST640 | |||
---|---|---|---|---|---|
() | |||||
7 | 2 | 0.798 | 0.879 | 0.979 | 28.05 |
5 | 2 | 0.796 | 0.877 | 0.985 | 41.51 |
3 | 2 | 0.798 | 0.882 | 0.988 | 27.30 |
2 | 2 | 0.766 | 0.860 | 0.994 | 70.87 |
Patch Size | Stride | SIRST | |||
---|---|---|---|---|---|
() | |||||
7 | 2 | 0.670 | 0.777 | 0.927 | 320.7 |
5 | 2 | 0.703 | 0.811 | 0.972 | 255.5 |
3 | 2 | 0.728 | 0.836 | 0.972 | 571.3 |
2 | 2 | 0.700 | 0.807 | 0.936 | 749.2 |
Form | IRST640 | |||||
---|---|---|---|---|---|---|
() | ||||||
BCE | - | - | 0.828 | 0.902 | 0.991 | 16.08 |
softIoU | - | - | 0 | 0 | 0 | 0 |
WA | 1 | - | 0.837 | 0.908 | 0.991 | 15.33 |
WA | 10 | - | 0.845 | 0.912 | 0.991 | 22.81 |
WA | 100 | - | 0.849 | 0.916 | 0.991 | 3.927 |
NL | 1 | 1 | 0.823 | 0.898 | 0.985 | 38.15 |
NL | 1 | 10 | 0.85 | 0.917 | 0.991 | 23.46 |
NL | 1 | 100 | 0.856 | 0.92 | 0.991 | 1.496 |
Form | SIRST | |||||
---|---|---|---|---|---|---|
() | ||||||
BCE | - | - | 0.731 | 0.836 | 0.963 | 270.6 |
softIoU | - | - | 0 | 0 | 0 | 0 |
WA | 1 | - | 0.671 | 0.777 | 0.963 | 834.4 |
WA | 10 | - | 0.698 | 0.801 | 0.945 | 407.6 |
WA | 100 | - | 0.727 | 0.826 | 0.972 | 158.4 |
NL | 1 | 1 | 0.634 | 0.735 | 0.853 | 359.7 |
NL | 1 | 10 | 0.702 | 0.803 | 0.908 | 236.0 |
NL | 1 | 100 | 0.743 | 0.845 | 0.982 | 124.6 |
Training | Test: IRST640 | Test: SIRST | ||||||
---|---|---|---|---|---|---|---|---|
() | () | |||||||
IRST640 | 0.858 | 0.921 | 0.991 | 0.3722 | 0.393 | 0.47 | 0.596 | 191.6 |
SIRST | 0.606 | 0.731 | 0.951 | 70.99 | 0.744 | 0.841 | 0.963 | 275.9 |
Mixed | 0.856 | 0.92 | 0.988 | 1.496 | 0.758 | 0.859 | 0.991 | 57.66 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, G.; Wang, W.; Tan, S. IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection. Remote Sens. 2022, 14, 3258. https://doi.org/10.3390/rs14143258
Chen G, Wang W, Tan S. IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection. Remote Sensing. 2022; 14(14):3258. https://doi.org/10.3390/rs14143258
Chicago/Turabian StyleChen, Gao, Weihua Wang, and Sirui Tan. 2022. "IRSTFormer: A Hierarchical Vision Transformer for Infrared Small Target Detection" Remote Sensing 14, no. 14: 3258. https://doi.org/10.3390/rs14143258