A Method Combining Discrete Cosine Transform with Attention for Multi-Temporal Remote Sensing Image Matching
<p>Overall framework of the proposed method.</p> "> Figure 2
<p>The details in the feature extraction module. (<b>a</b>) The structure of the blocks. BN is short for Batch Normalization. (<b>b</b>) The structure of the enhanced frequency channel attention (eFCA).</p> "> Figure 3
<p>The flowchart of DCT-guided sparse attention. For self-attention, <math display="inline"><semantics> <mi mathvariant="bold-italic">Q</mi> </semantics></math>, <math display="inline"><semantics> <mi mathvariant="bold-italic">K</mi> </semantics></math>, and <math display="inline"><semantics> <mi mathvariant="bold-italic">V</mi> </semantics></math> all come from the feature map to be updated. For cross-attention, <math display="inline"><semantics> <mi mathvariant="bold-italic">Q</mi> </semantics></math> comes from the feature map to be updated, while <math display="inline"><semantics> <mi mathvariant="bold-italic">K</mi> </semantics></math> and <math display="inline"><semantics> <mi mathvariant="bold-italic">V</mi> </semantics></math> come from the other one.</p> "> Figure 4
<p>Examples of the datasets [<a href="#B51-sensors-25-01345" class="html-bibr">51</a>,<a href="#B52-sensors-25-01345" class="html-bibr">52</a>,<a href="#B53-sensors-25-01345" class="html-bibr">53</a>].</p> "> Figure 5
<p>Box plots of ablation experiment results. The dots represent outliers of the results, and the orange lines indicate the median. (<b>a</b>) Box plot of PCK results. (<b>b</b>) Box plot of ACE results.</p> "> Figure 6
<p>The line charts show the proportion of results that met different PCK thresholds. The x-axis represents the set PCK thresholds, and the y-axis represents the proportion of images in the dataset that satisfied each threshold. (<b>a</b>) Results on the DSIFN dataset. (<b>b</b>) Results on the LEVIR-CD dataset.</p> "> Figure 7
<p>Visualization of some results from the DSIFN dataset and the LEVIR-CD dataset. The two groups on the left were from the DSIFN dataset, while the two groups on the right were from the LEVIR-CD dataset. The green dots indicate correctly matched keypoints, and the red dots indicate incorrectly matched keypoints.</p> "> Figure 8
<p>This is a visualization of qualitative comparison, with correct matching keypoints represented in green and incorrect matching keypoints represented in red.</p> ">
Abstract
:1. Introduction
- 1.
- We introduce a novel method for integrating DCT-based frequency channel attention into the CNN backbone, enhancing the feature robustness and discrimination in multi-temporal remote sensing scenarios.
- 2.
- We propose a frequency-guided sparse attention mechanism to enhance coarse-scale features. By narrowing the attention scope, this module minimizes noise introduced by the temporal difference region and simultaneously reduces the computational complexity.
- 3.
- Through comprehensive experiments, we validated that our DCT-integrated approach outperformed existing image matching frameworks in terms of the robustness and efficiency on multi-temporal remote sensing datasets.
2. Related Works
2.1. Image Matching Methods for Multi-Temporal Remote Sensing Images
2.2. Deep Learning-Based Image Matching Methods for General Tasks
2.3. DCT in Deep Learning
3. Method
3.1. DCT-Enhanced Multi-Scale Feature Extraction
3.2. DCT-Guided Coarse-Scale Matching
- 1.
- Vanilla attention:
- Each query pixel attends to all pixels.
- Feature dimension: d.
- Computational complexity: .
- 2.
- Our DCT-guided sparse attention (DSA):
- Each query pixel attends to k windows, each containing pixels.
- Total number of windows: .
- Selected windows per query: .
- Pixels per window: .
- Feature dimension: d.
- Number of DCT basis (constant): m.
- Computational complexity:
- –
- DCT compression: .
- –
- Window similarity calculation: .
- –
- Attention calculation:
- –
- Total:
- Image size: .
- Window size: .
- Selected windows: .
- DCT basis: .
3.3. Fine-Scale Matching
4. Experimental Results and Discussion
4.1. Experimental Setup
4.2. Datasets
4.3. Matching Experiment Evaluation Metrics
- 1.
- The success rate is calculated by
- 2.
- The percentage of correct keypoints is calculated by
- 3.
- The average corner error is calculated by
4.4. Ablation Study and Discussion
4.5. Comparative Experiments and Discussion
4.5.1. Quantitative Comparison
4.5.2. Qualitative Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Citeseer, Manchester, UK, 31 August – 2 September 1988; Volume 15, pp. 10–5244. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
- Turkar, V.; Deo, R.; Rao, Y.S.; Mohan, S.; Das, A. Classification Accuracy of Multi-Frequency and Multi-Polarization SAR Images for Various Land Covers. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 936–941. [Google Scholar] [CrossRef]
- Tondewad, M.P.S.; Dale, M.M.P. Remote Sensing Image Registration Methodology: Review and Discussion. Procedia Comput. Sci. 2020, 171, 2390–2399. [Google Scholar] [CrossRef]
- Xinghua, L.; Wenhao, A.; Ruitao, F.; Shaojie, L. Survey of remote sensing image registration based on deep learning. Natl. Remote Sens. Bull. 2023, 27, 267–284. [Google Scholar]
- Zhu, B.; Zhou, L.; Pu, S.; Fan, J.; Ye, Y. Advances and challenges in multimodal remote sensing image registration. IEEE J. Miniaturization Air Space Syst. 2023, 4, 165–174. [Google Scholar] [CrossRef]
- Fu, Z.; Zhang, J.; Tang, B.H. Multi-Temporal Snow-Covered Remote Sensing Image Matching via Image Transformation and Multi-Level Feature Extraction. Optics 2024, 5, 392–405. [Google Scholar] [CrossRef]
- Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the 2006 European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
- Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
- Li, J.; Hu, Q.; Ai, M. RIFT: Multi-modal image matching based on radiation-variation insensitive feature transform. IEEE Trans. Image Process. 2019, 29, 3296–3310. [Google Scholar] [CrossRef] [PubMed]
- Rasmy, L.; Sebari, I.; Ettarid, M. Automatic sub-pixel co-registration of remote sensing images using phase correlation and Harris detector. Remote Sens. 2021, 13, 2314. [Google Scholar] [CrossRef]
- Fan, Z.; Wang, M.; Pi, Y.; Liu, Y.; Jiang, H. A robust oriented filter-based matching method for multisource, multitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4703316. [Google Scholar] [CrossRef]
- Yang, Z.; Dan, T.; Yang, Y. Multi-temporal remote sensing image registration using deep convolutional features. IEEE Access 2018, 6, 38544–38555. [Google Scholar] [CrossRef]
- Liu, J.; Li, Y.; Chen, Y. Multi-temporal remote sensing image registration based on siamese network. In Proceedings of the 2021 International Conference on Computer Engineering and Application (ICCEA), Kunming, China, 25–27 June 2021; pp. 333–337. [Google Scholar]
- Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1658–1669. [Google Scholar]
- Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-resolution correspondence networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17346–17357. [Google Scholar]
- Zhou, Q.; Sattler, T.; Leal-Taixe, L. Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4669–4678. [Google Scholar]
- Xu, Y.; Li, J.; Du, C.; Chen, H. Nbr-net: A nonrigid bidirectional registration network for multitemporal remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5620715. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, S.; Li, B. Selective Context Network With Neighbourhood Consensus for Aerial Image Registration. In Proceedings of the 2022 6th International Conference on Computer Science and Artificial Intelligence, Beijing, China, 9–11 December 2022; pp. 258–264. [Google Scholar]
- Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. Matchnet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
- Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned invariant feature transform. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VI 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 467–483. [Google Scholar]
- Ono, Y.; Trulls, E.; Fua, P.; Yi, K.M. LF-Net: Learning local features from images. Adv. Neural Inf. Process. Syst. 2018, 31, 6237–6247. [Google Scholar]
- Georgakis, G.; Karanam, S.; Wu, Z.; Ernst, J.; Košecká, J. End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1965–1973. [Google Scholar]
- DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 337–33712. [Google Scholar] [CrossRef]
- Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chen, H.; Luo, Z.; Zhou, L.; Tian, Y.; Zhen, M.; Fang, T.; Mckinnon, D.; Tsin, Y.; Quan, L. Aspanformer: Detector-free image matching with adaptive span transformer. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 20–36. [Google Scholar]
- Tan, D.; Liu, J.J.; Chen, X.; Chen, C.; Zhang, R.; Shen, Y.; Ding, S.; Ji, R. Eco-tr: Efficient correspondences finding via coarse-to-fine refinement. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 317–334. [Google Scholar]
- Wang, Q.; Zhang, J.; Yang, K.; Peng, K.; Stiefelhagen, R. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision, Macau SAR, China, 4–8 December 2022; pp. 2746–2762. [Google Scholar]
- Rubel, O.; Rubel, A.; Lukin, V.; Egiazarian, K. Blind DCT-based prediction of image denoising efficiency using neural networks. In Proceedings of the 2018 7th European Workshop on Visual Information Processing (EUVIP), Tampere, Finland, 26–28 November 2018; pp. 1–6. [Google Scholar]
- Herbreteau, S.; Kervrann, C. DCT2net: An interpretable shallow CNN for image denoising. IEEE Trans. Image Process. 2022, 31, 4292–4305. [Google Scholar] [CrossRef]
- Karaoğlu, H.H.; Ekşioğlu, E.M. Revisiting DCT in Deep Learning Era: An Initial Denoising Application. In Proceedings of the 2024 32nd Signal Processing and Communications Applications Conference (SIU), Mersin, Turkiye, 15–18 May 2024; pp. 1–4. [Google Scholar]
- Raid, A.; Khedr, W.; El-Dosuky, M.A.; Ahmed, W. Jpeg image compression using discrete cosine transform—A survey. arXiv 2014, arXiv:1405.6147. [Google Scholar]
- Xue, J.; Yin, L.; Lan, Z.; Long, M.; Li, G.; Wang, Z.; Xie, X. 3D DCT based image compression method for the medical endoscopic application. Sensors 2021, 21, 1817. [Google Scholar] [CrossRef] [PubMed]
- Peng, Y.; Fu, C.; Cao, G.; Song, W.; Chen, J.; Sham, C.W. JPEG-compatible Joint Image Compression and Encryption Algorithm & nbsp;with File Size Preservation. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 1–20. [Google Scholar] [CrossRef]
- Duan, C.; Hu, B.; Liu, W.; Ma, T.; Ma, Q.; Wang, H. Infrared small target detection method based on frequency domain clutter suppression and spatial feature extraction. IEEE Access 2023, 11, 85549–85560. [Google Scholar] [CrossRef]
- Xu, Y.; Nakayama, H. Dct-based fast spectral convolution for deep convolutional neural networks. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Xu, R.; Kang, X.; Li, C.; Chen, H.; Ming, A. DCT-FANet: DCT based frequency attention network for single image super-resolution. Displays 2022, 74, 102220. [Google Scholar] [CrossRef]
- Ghosh, A.; Chellappa, R. Deep feature extraction in the DCT domain. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3536–3541. [Google Scholar]
- Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
- Chaudhury, S.; Yamasaki, T. Adversarial Robustness of Convolutional Models Learned in the Frequency Domain. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 7455–7459. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image Segmentation Using Deep Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 3523–3542. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Hu, Y.; Liu, Y.; Hui, B. Combining OpenStreetMap with Satellite Imagery to Enhance Cross-View Geo-Localization. Sensors 2025, 25, 44. [Google Scholar] [CrossRef] [PubMed]
- Quan, D.; Wang, S.; Li, Y.; Yang, B.; Huyan, N.; Chanussot, J.; Hou, B.; Jiao, L. Multi-relation attention network for image patch matching. IEEE Trans. Image Process. 2021, 30, 7127–7142. [Google Scholar] [CrossRef] [PubMed]
- Liu, M.; Zhou, G.; Ma, L.; Li, L.; Mei, Q. SIFNet: A self-attention interaction fusion network for multisource satellite imagery template matching. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103247. [Google Scholar] [CrossRef]
- Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
- Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
- Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Edstedt, J.; Bökman, G.; Wadenbäck, M.; Felsberg, M. DeDoDe: Detect, Don’t Describe—Describe, Don’t Detect for Local Feature Matching. In Proceedings of the 2024 International Conference on 3D Vision (3DV), Davos, Switzerland, 18–21 March 2024. [Google Scholar]
No | Approach | Params | SR (%) ↑ | PCK > 80 (%) ↑ | ACE (pix) ↓ |
---|---|---|---|---|---|
1 | Baseline | 4.09 M | 89.06 | 83.33 | 6.92 |
2 | Baseline + eFCA | 4.10 M | 93.75 | 89.94 | 5.25 |
3 | Baseline + DSA | 5.69 M | 96.09 | 85.76 | 4.08 |
4 | Baseline + eFCA + DSA (ours) | 5.70 M | 98.44 | 94.44 | 2.23 |
No | Approach | Params | SR (%) ↑ | PCK > 80 (%) ↑ | ACE (pix) ↓ |
---|---|---|---|---|---|
4 | Baseline + eFCA + DSA (ours) | 5.70 M | 98.44 | 94.44 | 2.23 |
5 | Baseline + FCA + DSA | 5.70 M | 96.86 | 90.58 | 3.73 |
No | Approach | Params | SR (%) ↑ | PCK > 80 (%) ↑ | ACE (pix) ↓ | RT (ms) ↓ |
---|---|---|---|---|---|---|
4 | Baseline + eFCA + DSA (ours) | 5.70 M | 98.44 | 94.44 | 2.23 | 66.39 |
6 | Baseline + eFCA + VanillaAttn | 8.05 M | 99.21 | 92.56 | 2.66 | 108.24 |
Approach | Params | SR (%) ↑ | PCK (%) ↑ | ACE (pix) ↓ | RT (ms) ↓ |
---|---|---|---|---|---|
RIFT | - | 87.50 | 72.03 | 13.41 | - |
OFM | - | 70.83 | 62.03 | 10.62 | - |
Super | 13.32 M | 89.58 | 58.05 | 11.14 | 91.43 |
LoFTR | 11.56 M | 97.92 | 76.73 | 6.73 | 193.05 |
DeDoDe | 13.52 M | 85.42 | 64.87 | 10.67 | 155.72 |
Ours | 5.70 M | 100 | 81.92 | 4.27 | 70.90 |
Approach | Params | SR (%) ↑ | PCK (%) ↑ | ACE (pix) ↓ | RT (ms) ↓ |
---|---|---|---|---|---|
RIFT | - | 90.63 | 70.47 | 10.48 | - |
OFM | - | 92.18 | 83.72 | 5.92 | - |
Super | 13.32 M | 98.44 | 75.95 | 5.78 | 172.43 |
LoFTR | 11.56 M | 98.44 | 90.41 | 2.04 | 656.35 |
DeDoDe | 13.52 M | 89.06 | 72.36 | 8.27 | 406.30 |
Ours | 5.70 M | 99.22 | 88.48 | 2.98 | 198.39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, Q.; Hui, B.; Liu, Z.; Xu, Z.; He, M. A Method Combining Discrete Cosine Transform with Attention for Multi-Temporal Remote Sensing Image Matching. Sensors 2025, 25, 1345. https://doi.org/10.3390/s25051345
Zeng Q, Hui B, Liu Z, Xu Z, He M. A Method Combining Discrete Cosine Transform with Attention for Multi-Temporal Remote Sensing Image Matching. Sensors. 2025; 25(5):1345. https://doi.org/10.3390/s25051345
Chicago/Turabian StyleZeng, Qinyan, Bin Hui, Zhaoji Liu, Zheng Xu, and Miao He. 2025. "A Method Combining Discrete Cosine Transform with Attention for Multi-Temporal Remote Sensing Image Matching" Sensors 25, no. 5: 1345. https://doi.org/10.3390/s25051345
APA StyleZeng, Q., Hui, B., Liu, Z., Xu, Z., & He, M. (2025). A Method Combining Discrete Cosine Transform with Attention for Multi-Temporal Remote Sensing Image Matching. Sensors, 25(5), 1345. https://doi.org/10.3390/s25051345