[go: up one dir, main page]

 
 
remotesensing-logo

Journal Browser

Journal Browser

Object Detection and Information Extraction Based on Remote Sensing Imagery

A special issue of Remote Sensing (ISSN 2072-4292). This special issue belongs to the section "Remote Sensing Image Processing".

Deadline for manuscript submissions: closed (31 May 2024) | Viewed by 42269

Special Issue Editors


E-Mail Website
Guest Editor
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China
Interests: deep learning; object detection and tracking; reinforcement learning; hyperspectral image processing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
Interests: mathematical models for visual information; graph matching problem and its applications; computer vision and machine learning; large-scale 3D reconstruction of visual scenes; information processing, fusion, and scene understanding in unmanned intelligent systems; interpretation and information mining of remote sensing images
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University, Xi’an 710071, China
Interests: remote sensing image processing; hyperspectral remote sensing; deep learning in remote sensing; change detection in remote sensing; remote sensing applications in urban planning; geospatial data analysis and modeling; SAR remote sensing
Special Issues, Collections and Topics in MDPI journals

grade E-Mail Website1 Website2
Guest Editor
School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
Interests: computer vision; pattern recognition; image processing; machine learning; deep learning; object detection and tracking; video analysis; remote sensing applications
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
1. International AI Future Lab on AI4EO, TUM, Munich, Germany
2. Visual Learning and Reasoning Team, Department EO Data Science, DLR-IMF, Oberpfaffenhofen, Germany
Interests: natural language and earth observation; UAV video understanding; 3D structure inference from monocular optical/SAR imagery; recognition in remote sensing imagery
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Remote sensing technology has become a fundamental means by which humans might observe the Earth, and has driven progress in many applicative fields, such as environmental surveillance, disaster monitoring, ocean situational awareness, traffic management, and modern military, etc. However, the intelligent interpretation of remote sensing data poses unique challenges due to the limited imaging capability, extremely high annotation costs, and insufficient multimodal data fusion. In recent years, deep learning techniques, represented by convolutional neural networks (CNNs) and transformers, have shown remarkable success in computer vision tasks due to their powerful feature extraction and representation capabilities. However, their application in remote sensing imagery is still relatively limited. In this Special Issue, we aim to compile state-of-the-art research pertaining to the application of machine learning methods for object detection and information extraction based on remote sensing imagery .

This Special Issue aims to present the latest advancements and emerging trends in the field of object detection and information extraction in remote sensing imagery. Specifically, the topics of interest include, but are not limited, to the following suggested themes:

  • Object detection and tracking in remote sensing images/videos;
  • Scene recognition, road extraction, semantic segmentation;
  • Anomaly detection and quality evaluation of remote sensing data;
  • Multi-modal remote sensing information extraction and fusion;
  • Few/zero-shot learning in remote sensing data;

Prof. Dr. Jie Feng
Prof. Dr. Gui-Song Xia
Prof. Dr. Xiangrong Zhang
Prof. Dr. Gong Cheng
Prof. Dr. Lichao Mou
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Remote Sensing is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2700 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • object detection of remote sensing images
  • object detection and tracking of remote sensing videos
  • few/zero-shot learning
  • multi-source data fusion
  • weakly supervised learning
  • semantic segmentation
  • remote sensing image classification

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (20 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 10628 KiB  
Article
A CNN- and Self-Attention-Based Maize Growth Stage Recognition Method and Platform from UAV Orthophoto Images
by Xindong Ni, Faming Wang, Hao Huang, Ling Wang, Changkai Wen and Du Chen
Remote Sens. 2024, 16(14), 2672; https://doi.org/10.3390/rs16142672 - 22 Jul 2024
Cited by 4 | Viewed by 1421
Abstract
The accurate recognition of maize growth stages is crucial for effective farmland management strategies. In order to overcome the difficulty of quickly obtaining precise information about maize growth stage in complex farmland scenarios, this study proposes a Maize Hybrid Vision Transformer (MaizeHT) that [...] Read more.
The accurate recognition of maize growth stages is crucial for effective farmland management strategies. In order to overcome the difficulty of quickly obtaining precise information about maize growth stage in complex farmland scenarios, this study proposes a Maize Hybrid Vision Transformer (MaizeHT) that combines a convolutional algorithmic structure with self-attention for maize growth stage recognition. The MaizeHT model utilizes a ResNet34 convolutional neural network to extract image features to self-attention, which are then transformed into sequence vectors (tokens) using Patch Embedding. It simultaneously inserts category information and location information as a token. A Transformer architecture with multi-head self-attention is employed to extract token features and predict maize growth stage categories using a linear layer. In addition, the MaizeHT model is standardized and encapsulated, and a prototype platform for intelligent maize growth stage recognition is developed for deployment on a website. Finally, the performance validation test of MaizeHT was carried out. To be specific, MaizeHT has an accuracy of 97.71% when the input image resolution is 224 × 224 and 98.71% when the input image resolution is 512 × 512 on the self-built dataset, the number of parameters is 15.446 M, and the floating-point operations are 4.148 G. The proposed maize growth stage recognition method could provide computational support for maize farm intelligence. Full article
Show Figures

Figure 1

Figure 1
<p>UAV for image acquisition of maize growth stages and its operational process.</p>
Full article ">Figure 2
<p>Study area and positions of ground modeling.</p>
Full article ">Figure 3
<p>Main growth stages of maize.</p>
Full article ">Figure 4
<p>Image sample taken by UAV after resizing.</p>
Full article ">Figure 5
<p>Redesigned classifier for GoogLeNet.</p>
Full article ">Figure 6
<p>Example network architectures for MaizeHT with an input image size of 224 × 224.</p>
Full article ">Figure 7
<p>Example Residual Block architectures for MaizeHT.</p>
Full article ">Figure 8
<p>Example MLP block architectures for MaizeHT.</p>
Full article ">Figure 9
<p>Platform architecture schematic.</p>
Full article ">Figure 10
<p>Intelligent recognition platform for maize growth stage: (<b>a</b>) home page, (<b>b</b>) technology introduction, (<b>c</b>) case presentation, (<b>d</b>) intelligent identification.</p>
Full article ">Figure 11
<p>Image input of size 224 × 224 for each batch after data enhancement.</p>
Full article ">Figure 12
<p>Image input of size 512 × 512 for each batch after data enhancement.</p>
Full article ">Figure 13
<p>The validation accuracy curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224.</p>
Full article ">Figure 14
<p>The validation loss curve generated during training; the input image size of MaizeHT_512 is 512 × 512, and the input images of other models have a size of 224 × 224.</p>
Full article ">Figure 15
<p>Confusion matrix generated by MaizeHT_224 on the test set with input image size 224 × 224.</p>
Full article ">Figure 16
<p>Confusion matrix generated by MaizeHT_512 on the test set with input image size 512 × 512.</p>
Full article ">Figure 17
<p>Maize growth stage recognition results.</p>
Full article ">
21 pages, 56125 KiB  
Article
SPA: Annotating Small Object with a Single Point in Remote Sensing Images
by Wenjie Zhao, Zhenyu Fang, Jun Cao and Zhangfeng Ju
Remote Sens. 2024, 16(14), 2515; https://doi.org/10.3390/rs16142515 - 9 Jul 2024
Cited by 1 | Viewed by 1141
Abstract
Detecting oriented small objects is a critical task in remote sensing, but the development of high-performance deep learning-based detectors is hindered by the need for large-scale and well-annotated datasets. The high cost of creating these datasets, due to the dense and numerous distribution [...] Read more.
Detecting oriented small objects is a critical task in remote sensing, but the development of high-performance deep learning-based detectors is hindered by the need for large-scale and well-annotated datasets. The high cost of creating these datasets, due to the dense and numerous distribution of small objects, significantly limits the application and development of such detectors. To address this problem, we propose a single-point-based annotation approach (SPA) based on the graph cut method. In this framework, user annotations act as the origin of positive sample points, and a similarity matrix, computed from feature maps extracted by deep learning networks, facilitates an intuitive and efficient annotation process for building graph elements. Utilizing the Maximum Flow algorithm, SPA derives positive sample regions from these points and generates oriented bounding boxes (OBBOXs). Experimental results demonstrate the effectiveness of SPA, with at least a 50% improvement in annotation efficiency. Furthermore, the intersection-over-union (IoU) metric of our OBBOX is 3.6% higher than existing methods such as the “Segment Anything Model”. When applied in training, the model annotated with SPA shows a 4.7% higher mean average precision (mAP) compared to models using traditional annotation methods. These results confirm the technical advantages and practical impact of SPA in advancing small object detection in remote sensing. Full article
Show Figures

Figure 1

Figure 1
<p>Example of existing labeling methods. (<b>a</b>) The graph cut-based method, (<b>b</b>) SAM, and (<b>c</b>) the proposed SPA. Blue and orange dots indicate the positive and negative annotations, respectively. Red boxes indicate the generated bounding boxes. Notably, the proposed SPA approach requires only a single point as annotation, eliminating the need for additional negative annotations in the background region.</p>
Full article ">Figure 2
<p>The workflow of our proposed methodology incorporates two novel modules, prominently highlighted in blue, designed to enhance the efficiency of the annotation process. Initially, our approach simplifies the annotation task by utilizing a versatile single-point selection method, thereby circumventing the constraints imposed by rigid placement rules. Following this, we introduce an innovative method for calculating a similarity matrix, significantly boosting computational efficiency. This technique expeditiously computes the cosine distance between pixels in the feature map, facilitating the identification of positive sample regions crucial for graph-based segmentation. Consequently, this enables the precise determination of bounding boxes using sophisticated graph segmentation methods.</p>
Full article ">Figure 3
<p>Visualization of the similarity matrix and resulting mask. (<b>a</b>) is the picture to be annotated. In (<b>b</b>), the similarity matrix at the feature map scale is visualized, where varying colors represent different degrees of similarity between nodes, highlighting the varying relationships within the data. As shown in (<b>b</b>), the mask derived from the similarity matrix is shown, effectively segmenting the areas of interest based on the computed similarities. This illustrates the practical application of the similarity matrix in identifying and isolating target regions within the image. And we can see the mask in (<b>c</b>).</p>
Full article ">Figure 4
<p>Graph constructed using graph cut. (<b>a</b>) shows positive sample regions highlighted in red, while all other nodes represent negative sample points. (<b>b</b>) shows the mask obtained by graph cut using the positive sample points at the feature map size; (<b>c</b>) is the mask mapped to the original map size.</p>
Full article ">Figure 5
<p>Positive sample region refinement using edge detection. (<b>a</b>) Extracted edge features show the detected boundaries around the targets, highlighting the essential edges within the region of interest. (<b>b</b>) The inverted mask, based on these detected edge features, isolates the relevant target areas by creating a negative image that can effectively differentiate target regions from the background. (<b>c</b>) The positive sample region without edge detection displays a less accurate boundary, often resulting in blurred or misaligned target edges due to interpolation. (<b>d</b>) The positive sample region incorporating edge detection demonstrates refined and accurate boundaries, significantly improving the localization of positive samples.</p>
Full article ">Figure 6
<p>Visualization of (<b>a</b>) origin picture to be annotated, (<b>b</b>) the vanilla graph cut, (<b>c</b>) lazy snapping, (<b>d</b>) SAM, (<b>e</b>) PNL, and (<b>f</b>) SPA results, respectively. The original positions of the annotated regions are marked in orange, blue, pink, and light blue, respectively. Boxes for erroneous annotations are highlighted in yellow, while true positive examples (IoU &gt; 0.75) annotations are labeled in red, and missed objects are labeled with green boxes. Occlusion and blurring are the two main challenges in annotating small objects. Graph cut and lazy snapping may not be able to annotate dense small objects, which may combine multiple objects into a single OBBOX. SAM annotation is more accurate, but interference from object blurring and shading leads to large annotation boxes. PNL using a radius of 1 for labeling points does not work well in our application scenario. In contrast, the proposed algorithm is more robust with respect to occlusion and blurring and can generate more accurate boxes.</p>
Full article ">Figure 7
<p>Two examples of problematic labeling effects are shown. In (<b>a</b>), the labeling frame is tilted, with the encroachment highlighted in blue. In (<b>b</b>), the front end of the large vehicle is not labeled correctly.</p>
Full article ">
29 pages, 13901 KiB  
Article
Dynamic Tracking Matched Filter with Adaptive Feedback Recurrent Neural Network for Accurate and Stable Ship Extraction in UAV Remote Sensing Images
by Dongyang Fu, Shangfeng Du, Yang Si, Yafeng Zhong and Yongze Li
Remote Sens. 2024, 16(12), 2203; https://doi.org/10.3390/rs16122203 - 17 Jun 2024
Cited by 1 | Viewed by 1172
Abstract
In an increasingly globalized world, the intelligent extraction of maritime targets is crucial for both military defense and maritime traffic monitoring. The flexibility and cost-effectiveness of unmanned aerial vehicles (UAVs) in remote sensing make them invaluable tools for ship extraction. Therefore, this paper [...] Read more.
In an increasingly globalized world, the intelligent extraction of maritime targets is crucial for both military defense and maritime traffic monitoring. The flexibility and cost-effectiveness of unmanned aerial vehicles (UAVs) in remote sensing make them invaluable tools for ship extraction. Therefore, this paper introduces a training-free, highly accurate, and stable method for ship extraction in UAV remote sensing images. First, we present the dynamic tracking matched filter (DTMF), which leverages the concept of time as a tuning factor to enhance the traditional matched filter (MF). This refinement gives DTMF superior adaptability and consistent detection performance across different time points. Next, the DTMF method is rigorously integrated into a recurrent neural network (RNN) framework using mathematical derivation and optimization principles. To further improve the convergence and robust of the RNN solution, we design an adaptive feedback recurrent neural network (AFRNN), which optimally solves the DTMF problem. Finally, we evaluate the performance of different methods based on ship extraction accuracy using specific evaluation metrics. The results show that the proposed methods achieve over 99% overall accuracy and KAPPA coefficients above 82% in various scenarios. This approach excels in complex scenes with multiple targets and background interference, delivering distinct and precise extraction results while minimizing errors. The efficacy of the DTMF method in extracting ship targets was validated through rigorous testing. Full article
Show Figures

Figure 1

Figure 1
<p>Technical line.</p>
Full article ">Figure 2
<p>Logical block diagram of the AFRNN model.</p>
Full article ">Figure 3
<p>Research region.</p>
Full article ">Figure 4
<p>Results of different methods for ship extraction in sample region 1 of UAV remote sensing image.</p>
Full article ">Figure 5
<p>Results of different methods for ship extraction in sample region 2 of UAV remote sensing image.</p>
Full article ">Figure 6
<p>Results of different methods for ship extraction in sample region 3 of UAV remote sensing image.</p>
Full article ">Figure 7
<p>Results of different methods for ship extraction in sample region 4 of UAV remote sensing image.</p>
Full article ">Figure 8
<p>Results of different methods for ship extraction in sample region 5 of UAV remote sensing image.</p>
Full article ">Figure 9
<p>Results of different methods for ship extraction in sample region 6 of UAV remote sensing image.</p>
Full article ">Figure 10
<p>The results of the visualization of the extraction effects of different methods were verified in four different scenarios. (<b>a</b>–<b>e</b>) represent the extraction experiments of test 1. The images are presented from left to right, with the original image at the top, followed by the DTMF extraction results, MF extraction results, CEM extraction results, and ACE extraction results. (<b>f</b>–<b>j</b>) represent the extraction experiments of test 2, with the original image displayed from left to right. The subsequent images are the DTMF extraction results, MF extraction results, CEM extraction results, and ACE extraction results. (<b>k</b>–<b>o</b>) represent the extraction experiments of test 3, with the original image displayed from left to right. The subsequent images are the DTMF extraction results, MF extraction results, CEM extraction results, and ACE extraction results. (<b>p</b>–<b>t</b>) represent the extraction experiments of test 4, with the original image displayed on the left and the DTMF, MF, CEM and ACE extraction results displayed in succession.</p>
Full article ">
22 pages, 4025 KiB  
Article
OPT-SAR-MS2Net: A Multi-Source Multi-Scale Siamese Network for Land Object Classification Using Remote Sensing Images
by Wei Hu, Xinhui Wang, Feng Zhan, Lu Cao, Yong Liu, Weili Yang, Mingjiang Ji, Ling Meng, Pengyu Guo, Zhi Yang and Yuhang Liu
Remote Sens. 2024, 16(11), 1850; https://doi.org/10.3390/rs16111850 - 22 May 2024
Cited by 1 | Viewed by 1308
Abstract
The utilization of optical and synthetic aperture radar (SAR) multi-source data to obtain better land classification results has received increasing research attention. However, there is a large property and distributional difference between optical and SAR data, resulting in an enormous challenge to fuse [...] Read more.
The utilization of optical and synthetic aperture radar (SAR) multi-source data to obtain better land classification results has received increasing research attention. However, there is a large property and distributional difference between optical and SAR data, resulting in an enormous challenge to fuse the inherent correlation information to better characterize land features. Additionally, scale differences in various features in remote sensing images also influence the classification results. To this end, an optical and SAR Siamese semantic segmentation network, OPT-SAR-MS2Net, is proposed. This network can intelligently learn effective multi-source features and realize end-to-end interpretation of multi-source data. Firstly, the Siamese network is used to extract features from optical and SAR images in different channels. In order to fuse the complementary information, the multi-source feature fusion module fuses the cross-modal heterogeneous remote sensing information from both high and low levels. To adapt to the multi-scale features of the land object, the multi-scale feature-sensing module generates multiple information perception fields. This enhances the network’s capability to learn contextual information. The experimental results obtained using WHU-OPT-SAR demonstrate that our method outperforms the state of the art, with an mIoU of 45.2% and an OA of 84.3%. These values are 2.3% and 2.6% better than those achieved by the most recent method, MCANet, respectively. Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>(<b>a</b>) A pair of SAR and optical remote sensing images in cloudy weather; (<b>b</b>) A pair of SAR and optical remote sensing images containing the land objects of houses and roads.</p>
Full article ">Figure 2
<p>Architecture of OPT-SAR-MS<sup>2</sup>Net.</p>
Full article ">Figure 3
<p>Multi-module feature extraction module OPT-SAR-MFF.</p>
Full article ">Figure 4
<p>Multi-scale information perception module OPT-SAR-MIP.</p>
Full article ">Figure 5
<p>First part of land object classification results in the WHU-OPT-SAR dataset.</p>
Full article ">Figure 6
<p>Second part of land object classification results in the WHU-OPT-SAR dataset.</p>
Full article ">Figure 7
<p>Third part of object classification results in the WHU-OPT-SAR dataset.</p>
Full article ">Figure 8
<p>Land object classification results in the dataset add Gaussian noise.</p>
Full article ">
19 pages, 18815 KiB  
Article
Research on Input Schemes for Polarimetric SAR Classification Using Deep Learning
by Shuaiying Zhang, Lizhen Cui, Yue Zhang, Tian Xia, Zhen Dong and Wentao An
Remote Sens. 2024, 16(11), 1826; https://doi.org/10.3390/rs16111826 - 21 May 2024
Cited by 1 | Viewed by 1237
Abstract
This study employs the reflection symmetry decomposition (RSD) method to extract polarization scattering features from ground object images, aiming to determine the optimal data input scheme for deep learning networks in polarimetric synthetic aperture radar classification. Eight distinct polarizing feature combinations were designed, [...] Read more.
This study employs the reflection symmetry decomposition (RSD) method to extract polarization scattering features from ground object images, aiming to determine the optimal data input scheme for deep learning networks in polarimetric synthetic aperture radar classification. Eight distinct polarizing feature combinations were designed, and the classification accuracy of various approaches was evaluated using the classic convolutional neural networks (CNNs) AlexNet and VGG16. The findings reveal that the commonly employed six-parameter input scheme, favored by many researchers, lacks the comprehensive utilization of polarization information and warrants attention. Intriguingly, leveraging the complete nine-parameter input scheme based on the polarization coherence matrix results in improved classification accuracy. Furthermore, the input scheme incorporating all 21 parameters from the RSD and polarization coherence matrix notably enhances overall accuracy and the Kappa coefficient compared to the other seven schemes. This comprehensive approach maximizes the utilization of polarization scattering information from ground objects, emerging as the most effective CNN input data scheme in this study. Additionally, the classification performance using the second and third component total power values (P2 and P3) from the RSD surpasses the approach utilizing surface scattering power value (PS) and secondary scattering power value (PD) from the same decomposition. Full article
Show Figures

Figure 1

Figure 1
<p>Classification of eight polarimetric data input schemes.</p>
Full article ">Figure 2
<p>Research area and ground truth map.</p>
Full article ">Figure 3
<p>Distribution of training, validation, and testing samples. (<b>a</b>) Image from 14 September 2021; (<b>b</b>) Image from 14 September 2021; (<b>c</b>) Image from 13 October 2021; (<b>d</b>) Image from 12 October 2017.</p>
Full article ">Figure 4
<p>Classification results of eight research schemes on AlexNet.</p>
Full article ">Figure 5
<p>Classification results of eight polarized data input schemes.</p>
Full article ">Figure 6
<p>Trend chart of overall classification accuracy and average accuracy.</p>
Full article ">
19 pages, 11008 KiB  
Article
SAM-Induced Pseudo Fully Supervised Learning for Weakly Supervised Object Detection in Remote Sensing Images
by Xiaoliang Qian, Chenyang Lin, Zhiwu Chen and Wei Wang
Remote Sens. 2024, 16(9), 1532; https://doi.org/10.3390/rs16091532 - 26 Apr 2024
Cited by 2 | Viewed by 1730
Abstract
Weakly supervised object detection (WSOD) in remote sensing images (RSIs) aims to detect high-value targets by solely utilizing image-level category labels; however, two problems have not been well addressed by existing methods. Firstly, the seed instances (SIs) are mined solely relying on the [...] Read more.
Weakly supervised object detection (WSOD) in remote sensing images (RSIs) aims to detect high-value targets by solely utilizing image-level category labels; however, two problems have not been well addressed by existing methods. Firstly, the seed instances (SIs) are mined solely relying on the category score (CS) of each proposal, which is inclined to concentrate on the most salient parts of the object; furthermore, they are unreliable because the robustness of the CS is not sufficient due to the fact that the inter-category similarity and intra-category diversity are more serious in RSIs. Secondly, the localization accuracy is limited by the proposals generated by the selective search or edge box algorithm. To address the first problem, a segment anything model (SAM)-induced seed instance-mining (SSIM) module is proposed, which mines the SIs according to the object quality score, which indicates the comprehensive characteristic of the category and the completeness of the object. To handle the second problem, a SAM-based pseudo-ground truth-mining (SPGTM) module is proposed to mine the pseudo-ground truth (PGT) instances, for which the localization is more accurate than traditional proposals by fully making use of the advantages of SAM, and the object-detection heads are trained by the PGT instances in a fully supervised manner. The ablation studies show the effectiveness of the SSIM and SPGTM modules. Comprehensive comparisons with 15 WSOD methods demonstrate the superiority of our method on two RSI datasets. Full article
Show Figures

Figure 1

Figure 1
<p>The framework of the proposed method. The WSDDN, SSIM, and SPGTM modules and PFSOD heads are introduced in <a href="#sec3dot1-remotesensing-16-01532" class="html-sec">Section 3.1</a>, <a href="#sec4dot2-remotesensing-16-01532" class="html-sec">Section 4.2</a>, <a href="#sec4dot3-remotesensing-16-01532" class="html-sec">Section 4.3</a>, and <a href="#sec4dot4-remotesensing-16-01532" class="html-sec">Section 4.4</a>, respectively.</p>
Full article ">Figure 2
<p>The number of object instances for each category in the two datasets.</p>
Full article ">Figure 3
<p>The values of (<b>a</b>) mAP and (<b>b</b>) CorLoc with respect to different <math display="inline"><semantics> <mi>α</mi> </semantics></math> on the DIOR dataset.</p>
Full article ">Figure 4
<p>The values of (<b>a</b>) mAP and (<b>b</b>) CorLoc with respect to different <span class="html-italic">K</span> on the DIOR dataset.</p>
Full article ">Figure 5
<p>Visualization of some results of the SPFS model on the NWPU VHR-10.v2 dataset.</p>
Full article ">Figure 6
<p>Visualization of some results of the SPFS model on the DIOR dataset.</p>
Full article ">
22 pages, 32231 KiB  
Article
MFIL-FCOS: A Multi-Scale Fusion and Interactive Learning Method for 2D Object Detection and Remote Sensing Image Detection
by Guoqing Zhang, Wenyu Yu and Ruixia Hou
Remote Sens. 2024, 16(6), 936; https://doi.org/10.3390/rs16060936 - 7 Mar 2024
Cited by 7 | Viewed by 1799
Abstract
Object detection is dedicated to finding objects in an image and estimate their categories and locations. Recently, object detection algorithms suffer from a loss of semantic information in the deeper feature maps due to the deepening of the backbone network. For example, when [...] Read more.
Object detection is dedicated to finding objects in an image and estimate their categories and locations. Recently, object detection algorithms suffer from a loss of semantic information in the deeper feature maps due to the deepening of the backbone network. For example, when using complex backbone networks, existing feature fusion methods cannot fuse information from different layers effectively. In addition, anchor-free object detection methods fail to accurately predict the same object due to the different learning mechanisms of the regression and centrality of the prediction branches. To address the above problem, we propose a multi-scale fusion and interactive learning method for fully convolutional one-stage anchor-free object detection, called MFIL-FCOS. Specifically, we designed a multi-scale fusion module to address the problem of local semantic information loss in high-level feature maps which strengthen the ability of feature extraction by enhancing the local information of low-level features and fusing the rich semantic information of high-level features. Furthermore, we propose an interactive learning module to increase the interactivity and more accurate predictions by generating a centrality-position weight adjustment regression task and a centrality prediction task. Following these strategic improvements, we conduct extensive experiments on the COCO and DIOR datasets, demonstrating its superior capabilities in 2D object detection tasks and remote sensing image detection, even under challenging conditions. Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Examples of the multi-scale scene challenges. The image on the left shows small-scale objects that are densely distributed and difficult to detect. The right image is a large-scale object that occupies a large number of pixels.</p>
Full article ">Figure 2
<p>The network architecture of FCOS. C3, C4, and C5 were from the backbone network. For final predictions, feature levels P3–P7 were employed. The dimensions H × W correspond to the height and width of the feature map. The ‘/s’ notation, where ‘s’ ranges from 8 to 128, signifies the downsampling rate from the input image.</p>
Full article ">Figure 3
<p>The network architecture of MFIL-FCOS. MFIL-FCOS is improved based on the FCOS network; MF is the multi-scale feature fusion module; and IL is the interactive learning module.</p>
Full article ">Figure 4
<p>The architecture of the MF module. When fusing deep feature maps upwards, a 3 × 3 convolution is used along with a 2× upsampling operation. When fusing down shallow feature maps, a 1 × 1 convolution along with a 2× downsampling operation. The final feature map is output after feature concatenation.</p>
Full article ">Figure 5
<p>The architecture of the IL module. GAP denotes the global average pooling of the feature map and FC is the output of features using a fully connected layer. The output feature maps are used for centerness prediction and regression branching.</p>
Full article ">Figure 6
<p>Illustration of Sim Box Refine. A coarse bounding box (orange box) at the location is first generated by predicting the four distances = <math display="inline"><semantics> <mrow> <mo>{</mo> <mo>△</mo> <mi>l</mi> <mo>,</mo> <mo>△</mo> <mi>t</mi> <mo>,</mo> <mo>△</mo> <mi>r</mi> <mo>,</mo> <mo>△</mo> <mi>b</mi> <mo>}</mo> </mrow> </semantics></math>. Four boundary points (orange points) are then predicted with respect to the four side points (green points). Finally, a finer bounding box (green box) is generated by aggregating the prediction results of the four boundary points.</p>
Full article ">Figure 7
<p>Image samples of 20 categories from the DIOR dataset. The list comprises 20 distinct object classes, namely airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, harbor, golf course, ground track field, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill.</p>
Full article ">Figure 8
<p>Examples of the detection results of FCOS versus MFIL-FCOS. Both models implement a ResNet-50 backbone architecture for object detection, trained using the COCO dataset. The detection results of FCOS and MFIL-FCOS are presented above and below, respectively. It is evident that MFIL-FCOS outperforms FCOS in detecting larger objects, and it can identify denser and smaller objects that FCOS fails to detect.</p>
Full article ">Figure 9
<p>Examples of MFIL-FCOS detection results on the COCO dataset. These results demonstrate that MFIL-FCOS performs effectively in diverse scenes, including occluded scenes, dense scenes, and multi-scale scenes.</p>
Full article ">Figure 10
<p>Convergence curves. MFIL-FCOS accelerates the training process for FCOS variants. The <span class="html-italic">x</span> axis corresponds to epochs and the <span class="html-italic">y</span> axis corresponds to the mAP evaluated on COCO dataset.</p>
Full article ">Figure 11
<p>Examples of the MFIL-FCOS detection results on the DIOR dataset. It can be seen that, for some small-scale objects such as ships, MFIL-FCOS is able to detect them well.</p>
Full article ">
25 pages, 9100 KiB  
Article
Spectral–Spatial Graph Convolutional Network with Dynamic-Synchronized Multiscale Features for Few-Shot Hyperspectral Image Classification
by Shuai Liu, Hongfei Li, Chengji Jiang and Jie Feng
Remote Sens. 2024, 16(5), 895; https://doi.org/10.3390/rs16050895 - 2 Mar 2024
Cited by 4 | Viewed by 2136
Abstract
The classifiers based on the convolutional neural network (CNN) and graph convolutional network (GCN) have demonstrated their effectiveness in hyperspectral image (HSI) classification. However, their performance is limited by the high time complexity of CNN, spatial complexity of GCN, and insufficient labeled samples. [...] Read more.
The classifiers based on the convolutional neural network (CNN) and graph convolutional network (GCN) have demonstrated their effectiveness in hyperspectral image (HSI) classification. However, their performance is limited by the high time complexity of CNN, spatial complexity of GCN, and insufficient labeled samples. To ease these limitations, the spectral–spatial graph convolutional network with dynamic-synchronized multiscale features is proposed for few-shot HSI classification. Firstly, multiscale patches are generated to enrich training samples in the feature space. A weighted spectral optimization module is explored to evaluate the discriminate information among different bands of patches. Then, the adaptive dynamic graph convolutional module is proposed to extract local and long-range spatial–spectral features of patches at each scale. Considering that features of different scales can be regarded as sequential data due to intrinsic correlations, the bidirectional LSTM is adopted to synchronously extract the spectral–spatial characteristics from all scales. Finally, auxiliary classifiers are utilized to predict labels of samples at each scale and enhance the training stability. Label smoothing is introduced into the classification loss to reduce the influence of misclassified samples and imbalance of classes. Extensive experiments demonstrate the superiority of the proposed method over other state-of-the-art methods, obtaining overall accuracies of 87.25%, 92.72%, and 93.36% on the Indian Pines, Pavia University, and Salinas datasets, respectively. Full article
Show Figures

Figure 1

Figure 1
<p>The overall architecture of the proposed method.</p>
Full article ">Figure 2
<p>Diagram of the weighted spectral optimization module.</p>
Full article ">Figure 3
<p>Diagram of the spatial–spectral CNN.</p>
Full article ">Figure 4
<p>Diagram of the adaptive dynamic graph neural network.</p>
Full article ">Figure 5
<p>Diagram of multiscale feature input to bidirectional LSTM.</p>
Full article ">Figure 6
<p>Classification visual maps on the IP dataset. (<b>a</b>) False-color image; (<b>b</b>) ground truth; (<b>c</b>) scale bar; (<b>d</b>) SVM; (<b>e</b>) 2DCNN; (<b>f</b>) 3D-CNN; (<b>g</b>) SSRN; (<b>h</b>) HybridSN; (<b>i</b>) S-DMM; (<b>j</b>) DCFSL; (<b>k</b>) proposed.</p>
Full article ">Figure 7
<p>Classification visual maps on the PU dataset. (<b>a</b>) False-color image; (<b>b</b>) ground truth; (<b>c</b>) scale bar; (<b>d</b>) SVM; (<b>e</b>) 2DCNN; (<b>f</b>) 3D-CNN; (<b>g</b>) SSRN; (<b>h</b>) HybridSN; (<b>i</b>) S-DMM; (<b>j</b>) DCFSL; (<b>k</b>) proposed.</p>
Full article ">Figure 8
<p>Classification visual maps on the SA dataset. (<b>a</b>) False-color image; (<b>b</b>) ground truth; (<b>c</b>) scale bar; (<b>d</b>) SVM; (<b>e</b>) 2DCNN; (<b>f</b>) 3D-CNN; (<b>g</b>) SSRN; (<b>h</b>) HybridSN; (<b>i</b>) S-DMM; (<b>j</b>) DCFSL; (<b>k</b>) proposed.</p>
Full article ">Figure 9
<p>Classification difference maps between predicted and ground-truth labels on the IP dataset. (<b>a</b>) Scale bar; (<b>b</b>) SVM; (<b>c</b>) 2DCNN; (<b>d</b>) 3D-CNN; (<b>e</b>) SSRN; (<b>f</b>) HybridSN; (<b>g</b>) S-DMM; (<b>h</b>) DCFSL; (<b>i</b>) proposed.</p>
Full article ">Figure 10
<p>Classification difference maps between predicted and ground-truth labels on the PU dataset. (<b>a</b>) Scale bar; (<b>b</b>) SVM; (<b>c</b>) 2DCNN; (<b>d</b>) 3D-CNN; (<b>e</b>) SSRN; (<b>f</b>) HybridSN; (<b>g</b>) S-DMM; (<b>h</b>) DCFSL; (<b>i</b>) proposed.</p>
Full article ">Figure 11
<p>Classification difference maps between predicted and ground-truth labels on the SA dataset. (<b>a</b>) Scale bar; (<b>b</b>) SVM; (<b>c</b>) 2DCNN; (<b>d</b>) 3D-CNN; (<b>e</b>) SSRN; (<b>f</b>) HybridSN; (<b>g</b>) S-DMM; (<b>h</b>) DCFSL; (<b>i</b>) proposed.</p>
Full article ">Figure 12
<p>The OAs of various methods under varying numbers of training samples on the IP dataset.</p>
Full article ">Figure 13
<p>The OAs of various methods under varying numbers of training samples on the PU dataset.</p>
Full article ">Figure 14
<p>The OAs of various methods under varying numbers of training samples on the SA dataset.</p>
Full article ">
23 pages, 22134 KiB  
Article
Multiobjective Evolutionary Superpixel Segmentation for PolSAR Image Classification
by Boce Chu, Mengxuan Zhang, Kun Ma, Long Liu, Junwei Wan, Jinyong Chen, Jie Chen and Hongcheng Zeng
Remote Sens. 2024, 16(5), 854; https://doi.org/10.3390/rs16050854 - 29 Feb 2024
Cited by 1 | Viewed by 1134
Abstract
Superpixel segmentation has been widely used in the field of computer vision. The generations of PolSAR superpixels have also been widely studied for their feasibility and high efficiency. The initial numbers of PolSAR superpixels are usually designed manually by experience, which has a [...] Read more.
Superpixel segmentation has been widely used in the field of computer vision. The generations of PolSAR superpixels have also been widely studied for their feasibility and high efficiency. The initial numbers of PolSAR superpixels are usually designed manually by experience, which has a significant impact on the final performance of superpixel segmentation and the subsequent interpretation tasks. Additionally, the effective information of PolSAR superpixels is not fully analyzed and utilized in the generation process. Regarding these issues, a multiobjective evolutionary superpixel segmentation for PolSAR image classification is proposed in this study. It contains two layers, an automatic optimization layer and a fine segmentation layer. Fully considering the similarity information within the superpixels and the difference information among the superpixels simultaneously, the automatic optimization layer can determine the suitable number of superpixels automatically by the multiobjective optimization for PolSAR superpixel segmentation. Considering the difficulty of the search for accurate boundaries of complex ground objects in PolSAR images, the fine segmentation layer can further improve the qualities of superpixels by fully using the boundary information of good-quality superpixels in the evolution process for generating PolSAR superpixels. The experiments on different PolSAR image datasets validate that the proposed approach can automatically generate high-quality superpixels without any prior information. Full article
Show Figures

Figure 1

Figure 1
<p>Overall framework of MOES.</p>
Full article ">Figure 2
<p>Individual encoding.</p>
Full article ">Figure 3
<p>Differential evolution strategy.</p>
Full article ">Figure 4
<p>Individual encoding and production of new superpixel centers.</p>
Full article ">Figure 5
<p>Evolutionary operator with boundary information.</p>
Full article ">Figure 6
<p>Flevoland dataset. (<b>a</b>) Flevoland image (PauliRGB); (<b>b</b>) ground truth.</p>
Full article ">Figure 7
<p>Wei River in Xi’an dataset. (<b>a</b>) Wei River in Xi’an image (PauliRGB); (<b>b</b>) ground truth.</p>
Full article ">Figure 8
<p>San Francisco dataset. (<b>a</b>) San Francisco image (PauliRGB); (<b>b</b>) ground truth.</p>
Full article ">Figure 9
<p>Sensitivity of parameter <math display="inline"><semantics> <mrow> <mi>m</mi> <mi>p</mi> <mi>o</mi> <mi>l</mi> </mrow> </semantics></math> on different PolSAR datasets. (<b>a</b>) Flevoland dataset; (<b>b</b>) Wei River in Xi’an dataset; (<b>c</b>) San Francisco dataset.</p>
Full article ">Figure 10
<p>Sensitivities of parameters <math display="inline"><semantics> <mrow> <mi>p</mi> <mi>o</mi> <msub> <mi>p</mi> <mn>1</mn> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>G</mi> <mn>1</mn> </msub> </mrow> </semantics></math> in automatic optimization layer in Flevoland dataset. (<b>a</b>) UE; (<b>b</b>) BR.</p>
Full article ">Figure 11
<p>Sensitivities of parameters <math display="inline"><semantics> <mrow> <mi>p</mi> <mi>o</mi> <msub> <mi>p</mi> <mn>2</mn> </msub> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi>G</mi> <mn>2</mn> </msub> </mrow> </semantics></math> in fine segmentation layer in Flevoland dataset. (<b>a</b>) UE; (<b>b</b>) BR.</p>
Full article ">Figure 12
<p>PFs of two layers of MOES in Flevoland dataset.</p>
Full article ">Figure 13
<p>PFs of two layers of MOES in Wei River in Xi’an dataset.</p>
Full article ">Figure 14
<p>PFs of two layers of MOES in San Francisco dataset.</p>
Full article ">Figure 15
<p>Visual superpixel segmentation results in Flevoland dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">Figure 16
<p>The enlarged images of the selected regions in visual results in Flevoland dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">Figure 17
<p>Visual superpixel segmentation results in Wei River in Xi’an dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">Figure 18
<p>The enlarged images of the selected regions in visual results in Wei River in Xi’an dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">Figure 19
<p>Visual superpixel segmentation results in San Francisco dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">Figure 20
<p>The enlarged images of the selected regions in visual results in San Francisco dataset. (<b>a</b>) SLIC; (<b>b</b>) SEEDS; (<b>c</b>) TP; (<b>d</b>) QS; (<b>e</b>) POL-HLT; (<b>f</b>) HCI; (<b>g</b>) MOES.</p>
Full article ">
24 pages, 4133 KiB  
Article
OII: An Orientation Information Integrating Network for Oriented Object Detection in Remote Sensing Images
by Yangfeixiao Liu and Wanshou Jiang
Remote Sens. 2024, 16(5), 731; https://doi.org/10.3390/rs16050731 - 20 Feb 2024
Cited by 3 | Viewed by 1673
Abstract
Oriented object detection for remote sensing images poses formidable challenges due to arbitrary orientation, diverse scales, and densely distributed targets (e.g., across terrain). Current investigations in remote sensing object detection have primarily focused on improving the representation of oriented bounding boxes yet have [...] Read more.
Oriented object detection for remote sensing images poses formidable challenges due to arbitrary orientation, diverse scales, and densely distributed targets (e.g., across terrain). Current investigations in remote sensing object detection have primarily focused on improving the representation of oriented bounding boxes yet have neglected the significant orientation information of targets in remote sensing contexts. Recent investigations point out that the inclusion and fusion of orientation information yields substantial benefits in training an accurate oriented object system. In this paper, we propose a simple but effective orientation information integrating (OII) network comprising two main parts: the orientation information highlighting (OIH) module and orientation feature fusion (OFF) module. The OIH module extracts orientation features from those produced by the backbone by modeling the frequency information of spatial features. Given that low-frequency components in an image capture its primary content, and high-frequency components contribute to its intricate details and edges, the transformation from the spatial domain to the frequency domain can effectively emphasize the orientation information of images. Subsequently, our OFF module employs a combination of a CNN attention mechanism and self-attention to derive weights for orientation features and original features. These derived weights are adopted to adaptively enhance the original features, resulting in integrated features that contain enriched orientation information. Given the inherent limitation of the original spatial attention weights in explicitly capturing orientation nuances, the incorporation of the introduced orientation weights serves as a pivotal tool to accentuate and delineate orientation information related to targets. Without unnecessary embellishments, our OII network achieves competitive detection accuracy on two prevalent remote sensing-oriented object detection datasets: DOTA (80.82 mAP) and HRSC2016 (98.32 mAP). Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Extracting orientation information through predefined anchors and transformation.</p>
Full article ">Figure 2
<p>Enhancing CNN features through the attention mechanism.</p>
Full article ">Figure 3
<p>Illustration of our OII network. <math display="inline"><semantics> <msub> <mi>C</mi> <mi>i</mi> </msub> </semantics></math> denotes the feature extracted by the backbone, <math display="inline"><semantics> <msub> <mi>O</mi> <mi>i</mi> </msub> </semantics></math> denotes the orientation feature extracted by OIH, and <math display="inline"><semantics> <msub> <mi>E</mi> <mi>i</mi> </msub> </semantics></math> represents the enhanced feature, which is fed into the neck.</p>
Full article ">Figure 4
<p>Orientation information extracted by DWT and DTCWT.</p>
Full article ">Figure 5
<p>Illustration of our OIH module.</p>
Full article ">Figure 6
<p>Overview of the OFF module. ⊗ denotes the matrix multiplication operation, and ⊕ indicates the element-wise addition operation.</p>
Full article ">Figure 7
<p>Overview of the MAA module. ⊙ denotes element-wise multiplication operation, and ⊕ indicates the element-wise addition operation.</p>
Full article ">Figure 8
<p>Qualitative detection results of our OII model on the DOTA dataset.</p>
Full article ">Figure 9
<p>The number of instances corresponding to each category.</p>
Full article ">Figure 10
<p>Illustration of the detection results of the baseline methods and our OII model. The yellow box represents the target that was not detected by the baseline but was detected by OII.</p>
Full article ">Figure 11
<p>Visualization of feature map from the baseline and our OII model. The redder the region, the more attention the network pays to.</p>
Full article ">
25 pages, 2880 KiB  
Article
Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images
by Lina Huo, Jiayue Hou, Jie Feng, Wei Wang and Jinsheng Liu
Remote Sens. 2024, 16(4), 624; https://doi.org/10.3390/rs16040624 - 7 Feb 2024
Cited by 5 | Viewed by 1900
Abstract
Salient Object Detection (SOD) is gradually applied in natural scene images. However, due to the apparent differences between optical remote sensing images and natural scene images, directly applying the SOD of natural scene images to optical remote sensing images has limited performance in [...] Read more.
Salient Object Detection (SOD) is gradually applied in natural scene images. However, due to the apparent differences between optical remote sensing images and natural scene images, directly applying the SOD of natural scene images to optical remote sensing images has limited performance in global context information. Therefore, salient object detection in optical remote sensing images (ORSI-SOD) is challenging. Optical remote sensing images usually have large-scale variations. However, the vast majority of networks are based on Convolutional Neural Network (CNN) backbone networks such as VGG and ResNet, which can only extract local features. To address this problem, we designed a new model that employs a transformer-based backbone network capable of extracting global information and remote dependencies. A new framework is proposed for this question, named Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images (GMANet). In this framework, the Pyramid Vision Transformer (PVT) is an encoder to catch remote dependencies. A Multiscale Attention Module (MAM) is introduced for extracting multiscale information. Meanwhile, a Global Guiled Brach (GGB) is used to learn the global context information and obtain the complete structure. Four MAMs are densely connected to this GGB. The Aggregate Refinement Module (ARM) is used to enrich the details of edge and low-level features. The ARM fuses global context information and encoder multilevel features to complement the details while the structure is complete. Extensive experiments on two public datasets show that our proposed framework GMANet outperforms 28 state-of-the-art methods on six evaluation metrics, especially E-measure and F-measure. It is because we apply a coarse-to-fine strategy to merge global context information and multiscale information. Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>The overall framework of the proposed Global and Multiscale Aggregate Network for Saliency Object Detection in Optical Remote Sensing Images (GMANet). GMANet consists of four main parts: the PVT-v2 encoder, the global guide branch, the Aggregate Refinement Module (ARM), and the dense decoder (DD), where the global guide branch consists of four densely connected multiscale attention modules (MAM). First, four feature maps of different levels are generated by the encoder PVT-v2, which are fed into the Global Guidance Branch (GGB) to learn global context information. The global context information and high-level and low-level features are fused through the Aggregate Refinement Module (ARM) and then input into the Dense Decoder (DD) for further analysis. Notably, in the training phase, we adopt the deep supervision strategy and attach supervision to each decoder block. GT denotes ground truth.</p>
Full article ">Figure 2
<p>Illustration of Multiscale Attention Module (MAM).</p>
Full article ">Figure 3
<p>Illustration of Aggregation Refinement Module (ARM).</p>
Full article ">Figure 4
<p>Illustration of Dense Decoder (DD). Blue represents DSConv with different dilation rates, and yellow represents 1 × 1 convolution.</p>
Full article ">Figure 5
<p>Quantitative comparison of the PR curves of SOD methods on EORSSD and ORSSD datasets.</p>
Full article ">Figure 6
<p>Visual comparisons with eight representative state-of-the-art methods. Please zoom in for the best view.</p>
Full article ">Figure 7
<p>GGB Variant. (<b>a</b>) Consists of four MAM modules directly spliced, and (<b>b</b>) consists of four MAM modules densely connected.</p>
Full article ">Figure 8
<p>Ablation studies to evaluate the contribution of the BCE and IoU in loss functions. The best result for each column is in bold.</p>
Full article ">
25 pages, 7566 KiB  
Article
Multi-Level Feature Extraction Networks for Hyperspectral Image Classification
by Shaoyi Fang, Xinyu Li, Shimao Tian, Weihao Chen and Erlei Zhang
Remote Sens. 2024, 16(3), 590; https://doi.org/10.3390/rs16030590 - 4 Feb 2024
Cited by 3 | Viewed by 2653
Abstract
Hyperspectral image (HSI) classification plays a key role in the field of earth observation missions. Recently, transformer-based approaches have been widely used for HSI classification due to their ability to model long-range sequences. However, these methods face two main challenges. First, they treat [...] Read more.
Hyperspectral image (HSI) classification plays a key role in the field of earth observation missions. Recently, transformer-based approaches have been widely used for HSI classification due to their ability to model long-range sequences. However, these methods face two main challenges. First, they treat HSI as linear vectors, disregarding their 3D attributes and spatial structure. Second, the repeated concatenation of encoders leads to information loss and gradient vanishing. To overcome these challenges, we propose a new solution called the multi-level feature extraction network (MLFEN). MLFEN consists of two sub-networks: the hybrid convolutional attention module (HCAM) and the enhanced dense vision transformer (EDVT). HCAM incorporates a band shift strategy to eliminate the edge effect of convolution and utilizes hybrid convolutional blocks to capture the 3D properties and spatial structure of HSI. Additionally, an attention module is introduced to identify strongly discriminative features. EDVT reconfigures the organization of original encoders by incorporating dense connections and adaptive feature fusion components, enabling faster propagation of information and mitigating the problem of gradient vanishing. Furthermore, we propose a novel sparse loss function to better fit the data distribution. Extensive experiments conducted on three public datasets demonstrate the significant advancements achieved by MLFEN. Full article
Show Figures

Figure 1

Figure 1
<p>The overall framework of the proposed MLFEN. MLFEN is composed of two primary sub-networks, including HCAM and EDVT. HCAM consists of three key components, which are band shift, convolutional operations (e.g., 3D Block and 2D Block), and AM. While EDVT incorporates two design elements, namely dense connection and AFF.</p>
Full article ">Figure 2
<p>The illustration of the EDVT.</p>
Full article ">Figure 3
<p>Visualization of five datasets. (<b>a</b>) PU dataset (false color image (RGB-R: 56, G: 33, B: 13)), (<b>b</b>) SA dataset (false color image (RGB-R: 38, G: 5, B: 20)), (<b>c</b>) KSC dataset(false color image (RGB-R: 59, G: 40, B: 23)), (<b>d</b>) IP dataset (false color image (RGB-R: 50, G: 30, B: 20)), and (<b>e</b>) HU dataset (false color image (RGB-R: 50, G: 30, B: 20)).</p>
Full article ">Figure 4
<p>Impact of patch size <span class="html-italic">k</span> on the performance of MLFEN.</p>
Full article ">Figure 5
<p>Impact of the number of training samples on the performance of MLFEN.</p>
Full article ">Figure 6
<p>Classification maps on the PU dataset. (<b>a</b>) 2D-CNN. (<b>b</b>) 3D-CNN. (<b>c</b>) HybridSN. (<b>d</b>) SSAN. (<b>e</b>) MAFN. (<b>f</b>) ViT. (<b>g</b>) HiT. (<b>h</b>) CVSSN. (<b>i</b>) SSFTT. (<b>j</b>) MLFEN (Ours). (<b>k</b>) Ground truth. (<b>l</b>) Color bar.</p>
Full article ">Figure 7
<p>Classification maps on the KSC dataset. (<b>a</b>) 2D-CNN. (<b>b</b>) 3D-CNN. (<b>c</b>) HybridSN. (<b>d</b>) SSAN. (<b>e</b>) MAFN. (<b>f</b>) ViT. (<b>g</b>) HiT. (<b>h</b>) CVSSN. (<b>i</b>) SSFTT. (<b>j</b>) MLFEN (Ours). (<b>k</b>) Ground truth. (<b>l</b>) Color bar.</p>
Full article ">Figure 8
<p>Classification maps on the SA dataset. (<b>a</b>) 2D-CNN. (<b>b</b>) 3D-CNN. (<b>c</b>) HybridSN. (<b>d</b>) SSAN. (<b>e</b>) MAFN. (<b>f</b>) ViT. (<b>g</b>) HiT. (<b>h</b>) CVSSN. (<b>i</b>) SSFTT. (<b>j</b>) MLFEN (Ours). (<b>k</b>) Ground truth. (<b>l</b>) Color bar.</p>
Full article ">Figure 9
<p>Classification maps on the IP dataset. (<b>a</b>) 2D-CNN. (<b>b</b>) 3D-CNN. (<b>c</b>) HybridSN. (<b>d</b>) SSAN. (<b>e</b>) MAFN. (<b>f</b>) ViT. (<b>g</b>) HiT. (<b>h</b>) CVSSN. (<b>i</b>) SSFTT. (<b>j</b>) MLFEN (Ours). (<b>k</b>) Ground truth. (<b>l</b>) Color bar.</p>
Full article ">Figure 10
<p>Classification maps on the HU dataset. (<b>a</b>) 2D-CNN. (<b>b</b>) 3D-CNN. (<b>c</b>) HybridSN. (<b>d</b>) SSAN. (<b>e</b>) MAFN. (<b>f</b>) ViT. (<b>g</b>) HiT. (<b>h</b>) CVSSN. (<b>i</b>) SSFTT. (<b>j</b>) MLFEN (Ours). (<b>k</b>) Ground truth. (<b>l</b>) Color bar.</p>
Full article ">Figure 11
<p>Running time and classification performance of different models.</p>
Full article ">
23 pages, 9145 KiB  
Article
A Multi-Feature Fusion-Based Method for Crater Extraction of Airport Runways in Remote-Sensing Images
by Yalun Zhao, Derong Chen and Jiulu Gong
Remote Sens. 2024, 16(3), 573; https://doi.org/10.3390/rs16030573 - 2 Feb 2024
Cited by 4 | Viewed by 1567
Abstract
Due to the influence of the complex background of airports and damaged areas of the runway, the existing runway extraction methods do not perform well. Furthermore, the accurate crater extraction of airport runways plays a vital role in the military fields, but there [...] Read more.
Due to the influence of the complex background of airports and damaged areas of the runway, the existing runway extraction methods do not perform well. Furthermore, the accurate crater extraction of airport runways plays a vital role in the military fields, but there are few related studies on this topic. To solve these problems, this paper proposes an effective method for the crater extraction of runways, which mainly consists of two stages: airport runway extraction and runway crater extraction. For the previous stage, we first apply corner detection and screening strategies to runway extraction based on multiple features of the runway, such as high brightness, regional texture similarity, and shape of the runway to improve the completeness of runway extraction. In addition, the proposed method can automatically realize the complete extraction of runways with different degrees of damage. For the latter stage, the craters of the runway can be extracted by calculating the edge gradient amplitude and grayscale distribution standard deviation of the candidate areas within the runway extraction results. In four typical remote-sensing images and four post-damage remote-sensing images, the average integrity of the runway extraction reaches more than 90%. The comparative experiment results show that the extraction effect and running speed of our method are both better than those of state-of-the-art methods. In addition, the final experimental results of crater extraction show that the proposed method can effectively extract craters of airport runways, and the extraction precision and recall both reach more than 80%. Overall, our research is of great significance to the damage assessment of airport runways based on remote-sensing images in the military fields. Full article
Show Figures

Figure 1

Figure 1
<p>The flowchart of the proposed runway extraction method.</p>
Full article ">Figure 2
<p>The process of feature extraction: (<b>a</b>) the extraction of runway edge parallel line segments; and (<b>b</b>) the extraction of runway endpoints.</p>
Full article ">Figure 3
<p>Geometric model: (<b>a</b>) the neighborhood geometric model for line extraction; and (<b>b</b>) the neighborhood geometric model for corner extraction.</p>
Full article ">Figure 4
<p>Parallel and distance constraints.</p>
Full article ">Figure 5
<p>The sample graph of the screening of parallel line segment pairs: (<b>a</b>) the screening results based on parallel and distance constraints (the interference pairs are marked with blue and yellow circles); (<b>b</b>) the screening results combined with regional feature constraints.</p>
Full article ">Figure 6
<p>Regional feature constraints.</p>
Full article ">Figure 7
<p>Screening of runway edge corner points: (<b>a</b>) illustration of the inner and outer areas of the runway on both sides of the edge; (<b>b</b>) illustration of the high-brightness area in the corner neighborhood and the template area within the runway; (<b>c</b>) instruction for selecting high-brightness area in corner neighborhood.</p>
Full article ">Figure 8
<p>Test images of airport runways. (<b>a</b>) Original image #1–8: Image 1 (Hartsfield-Jackson Atlanta International Airport, USA), Image 2 (Osan Air Base, Korea), Image 3 (Charlotte-Douglas International Airport, USA), Image 4 (Rick Husband Amarillo International Airport, USA), Image 5 (Ponikve Airport, Serbia), Image 6 (Ponikve Airport, Serbia), Image 7 (Sjenica Air Base, Serbia), Image 8 (OBRVA Airfield, Serbia); (<b>b</b>) The results of the proposed runway extraction method marked with red boxes; (<b>c</b>) The binary results of the proposed runway extraction method; (<b>d</b>) Ground truth.</p>
Full article ">Figure 8 Cont.
<p>Test images of airport runways. (<b>a</b>) Original image #1–8: Image 1 (Hartsfield-Jackson Atlanta International Airport, USA), Image 2 (Osan Air Base, Korea), Image 3 (Charlotte-Douglas International Airport, USA), Image 4 (Rick Husband Amarillo International Airport, USA), Image 5 (Ponikve Airport, Serbia), Image 6 (Ponikve Airport, Serbia), Image 7 (Sjenica Air Base, Serbia), Image 8 (OBRVA Airfield, Serbia); (<b>b</b>) The results of the proposed runway extraction method marked with red boxes; (<b>c</b>) The binary results of the proposed runway extraction method; (<b>d</b>) Ground truth.</p>
Full article ">Figure 9
<p>The runway extraction results of methods in this article and [<a href="#B15-remotesensing-16-00573" class="html-bibr">15</a>,<a href="#B17-remotesensing-16-00573" class="html-bibr">17</a>] on test images #2 and #6. (<b>a</b>) Original images; (<b>b</b>) The runway extraction results based on [<a href="#B15-remotesensing-16-00573" class="html-bibr">15</a>]; (<b>c</b>) The runway extraction results based on [<a href="#B17-remotesensing-16-00573" class="html-bibr">17</a>]; (<b>d</b>) The runway extraction results of the proposed method; (<b>e</b>) Ground truth.</p>
Full article ">Figure 10
<p>Test Image #5: (<b>a</b>) original image; (<b>b</b>) the result of the proposed runway extraction method marked with red box; (<b>c</b>) the result of the proposed crater extraction method marked with red circles; (<b>d</b>) the result of the method [<a href="#B32-remotesensing-16-00573" class="html-bibr">32</a>] marked with red circles; (<b>e</b>) the binary result of the proposed method; (<b>f</b>) ground truth of runway and craters.</p>
Full article ">Figure 10 Cont.
<p>Test Image #5: (<b>a</b>) original image; (<b>b</b>) the result of the proposed runway extraction method marked with red box; (<b>c</b>) the result of the proposed crater extraction method marked with red circles; (<b>d</b>) the result of the method [<a href="#B32-remotesensing-16-00573" class="html-bibr">32</a>] marked with red circles; (<b>e</b>) the binary result of the proposed method; (<b>f</b>) ground truth of runway and craters.</p>
Full article ">Figure 11
<p>Test Image #6: (<b>a</b>) original image; (<b>b</b>) the result of the proposed runway extraction method marked with red box; (<b>c</b>) the result of the proposed crater extraction method marked with red circles; (<b>d</b>) the result of the method [<a href="#B32-remotesensing-16-00573" class="html-bibr">32</a>] marked with red circles; (<b>e</b>) the binary result of the proposed method; (<b>f</b>) ground truth of runway and craters.</p>
Full article ">Figure 11 Cont.
<p>Test Image #6: (<b>a</b>) original image; (<b>b</b>) the result of the proposed runway extraction method marked with red box; (<b>c</b>) the result of the proposed crater extraction method marked with red circles; (<b>d</b>) the result of the method [<a href="#B32-remotesensing-16-00573" class="html-bibr">32</a>] marked with red circles; (<b>e</b>) the binary result of the proposed method; (<b>f</b>) ground truth of runway and craters.</p>
Full article ">
22 pages, 6084 KiB  
Article
TreeDetector: Using Deep Learning for the Localization and Reconstruction of Urban Trees from High-Resolution Remote Sensing Images
by Haoyu Gong, Qian Sun, Chenrong Fang, Le Sun and Ran Su
Remote Sens. 2024, 16(3), 524; https://doi.org/10.3390/rs16030524 - 30 Jan 2024
Cited by 2 | Viewed by 3074
Abstract
There have been considerable efforts in generating tree crown maps from satellite images. However, tree localization in urban environments using satellite imagery remains a challenging task. One of the difficulties in complex urban tree detection tasks lies in the segmentation of dense tree [...] Read more.
There have been considerable efforts in generating tree crown maps from satellite images. However, tree localization in urban environments using satellite imagery remains a challenging task. One of the difficulties in complex urban tree detection tasks lies in the segmentation of dense tree crowns. Currently, methods based on semantic segmentation algorithms have made significant progress. We propose to split the tree localization problem into two parts, dense clusters and single trees, and combine the target detection method with a procedural generation method based on planting rules for the complex urban tree detection task, which improves the accuracy of single tree detection. Specifically, we propose a two-stage urban tree localization pipeline that leverages deep learning and planting strategy algorithms along with region discrimination methods. This approach ensures the precise localization of individual trees while also facilitating distribution inference within dense tree canopies. Additionally, our method estimates the radius and height of trees, which provides significant advantages for three-dimensional reconstruction tasks from remote sensing images. We compare our results with other existing methods, achieving an 82.3% accuracy in individual tree localization. This method can be seamlessly integrated with the three-dimensional reconstruction of urban trees. We visualized the three-dimensional reconstruction of urban trees generated by this method, which demonstrates the diversity of tree heights and provides a more realistic solution for tree distribution generation. Full article
Show Figures

Figure 1

Figure 1
<p>The workflow diagram from input images to 3D modeling. In the first stage, the input remote sensing images go through an object detection neural network to identify three types of tree units. The tree clusters and low shrubs are then passed to the segmentation network to obtain their respective segmentation images. In the second stage, we start by extracting features from the segmentation images. Next, we apply region partitioning and planting rule algorithms to determine the positions of the trees. Finally, after estimating the tree heights, the 3D modeling is performed.</p>
Full article ">Figure 2
<p>Different annotation data. In (<b>A</b>), we show the annotation of residential areas. This image contains three types of tree units and includes roads and buildings. Part (<b>B</b>) primarily features densely arranged tree clusters and does not include roads or buildings. In (<b>C</b>), there are large clusters of dense forests, and we consider them as a whole.</p>
Full article ">Figure 3
<p>Obtaining tree units using neural networks. In this diagram, the input remote sensing image first passes through an object detection neural network. After obtaining the objects and classifying them, we extract these objects and feed them into a semantic segmentation network. Finally, we obtain a canopy coverage map for the tree clusters.</p>
Full article ">Figure 4
<p>Tree cluster diagrams with different features. The tree clusters (blue areas) in the left and right images have a similar perimeter-to-area ratio, but their different aspect ratios result in distinct shapes and sizes. Therefore, it is necessary to use two different feature metrics to evaluate them.</p>
Full article ">Figure 5
<p>Illustrations of different planting rules. We utilized a total of eight different planting rules. Under each rule, tree crown sizes vary. In the lower right corner of each planting rule, we show the corresponding real-world tree distribution for different rules. In this representation, we use green for larger crowns, red for medium-sized crowns, and yellow for smaller crowns. For further details, refer to <a href="#sec3dot3dot3-remotesensing-16-00524" class="html-sec">Section 3.3.3</a>. Our approach under these rules exhibits a more realistic representation of tree clusters.</p>
Full article ">Figure 6
<p>Influence of different data augmentation methods on final loss. In this figure, we illustrate the impact of different data augmentation methods on the final loss. Too little data augmentation can result in insignificant effects, while excessive augmentation can deviate from real remote sensing images. Hence, most of the floating values are set closer to the median value to achieve optimal results.</p>
Full article ">Figure 7
<p>Comparison of detection performance for different networks. In rows (<b>A</b>–<b>C</b>), you can see a comparison of the detection results for different networks. The leftmost column shows the original image, while the following columns display the detection results of various networks. Row (<b>D</b>) is an enlarged view of the top left part of row (<b>B</b>), with highlighted bounding boxes to better visualize the results. Our method reduces the issue of duplicate detections.</p>
Full article ">Figure 8
<p>The 3D reconstruction results for different images. In this figure, you can observe the 3D reconstruction results for various images. (<b>A</b>(<b>1</b>)) and (<b>A</b>(<b>3</b>)) represent residential areas, while (<b>A</b>(<b>2</b>)) and (<b>A</b>(<b>4</b>)) correspond to farmland. (<b>B</b>(<b>1</b>))–(<b>B</b>(<b>4</b>)) show the respective 3D reconstruction results.</p>
Full article ">Figure 9
<p>Effects of different planting rules applied in 3D reconstruction. In this figure, we illustrate the effects of different planting rules applied in 3D reconstruction. We utilized clump planting, row planting, and natural planting 2layers for this demonstration.</p>
Full article ">Figure 10
<p>Original images and rendered 3D reconstruction results. In this figure, we showcase the original images and the rendered results of our 3D reconstruction. After rendering, our reconstruction results effectively capture the tree environment from the original images. In the close-up view below, you can observe the realistic arrangement of our trees.</p>
Full article ">
18 pages, 4197 KiB  
Article
City Scale Traffic Monitoring Using WorldView Satellite Imagery and Deep Learning: A Case Study of Barcelona
by Annalisa Sheehan, Andrew Beddows, David C. Green and Sean Beevers
Remote Sens. 2023, 15(24), 5709; https://doi.org/10.3390/rs15245709 - 13 Dec 2023
Cited by 3 | Viewed by 4219
Abstract
Accurate traffic data is crucial for a range of different applications such as quantifying vehicle emissions, and transportation planning and management. However, the availability of traffic data is geographically fragmented and is rarely held in an accessible form. Therefore, there is an urgent [...] Read more.
Accurate traffic data is crucial for a range of different applications such as quantifying vehicle emissions, and transportation planning and management. However, the availability of traffic data is geographically fragmented and is rarely held in an accessible form. Therefore, there is an urgent need for a common approach to developing large urban traffic data sets. Utilising satellite data to estimate traffic data offers a cost-effective and standardized alternative to ground-based traffic monitoring. This study used high-resolution satellite imagery (WorldView-2 and 3) and Deep Learning (DL) to identify vehicles, road by road, in Barcelona (2017–2019). The You Only Look Once (YOLOv3) object detection model was trained and model accuracy was investigated via parameters such as training data set specific anchor boxes, network resolution, image colour band composition and input image size. The best performing vehicle detection model configuration had a precision (proportion of positive detections that were correct) of 0.69 and a recall (proportion of objects in the image correctly identified) of 0.79. We demonstrated that high-resolution satellite imagery and object detection models can be utilised to identify vehicles at a city scale. However, the approach highlights challenges relating to identifying vehicles on narrow roads, in shadow, under vegetation, and obstructed by buildings. This is the first time that DL has been used to identify vehicles at a city scale and demonstrates the possibility of applying these methods to cities globally where data are often unavailable. Full article
Show Figures

Figure 1

Figure 1
<p>Images showing the differences between the training data set classes: parked, static, and moving.</p>
Full article ">Figure 2
<p>K-means cluster assignments and means for Barcelona training data set bounding box widths and heights into nine clusters. The nine different clusters are shown using different colours and labelled 0–8 in the image.</p>
Full article ">Figure 3
<p>Total loss, precision and recall plots averaged every 10 epochs for (<b>a</b>) Model 6 (YOLOv3, RGB, PSM, 1000 epochs, 1500-pixel image size, 9 Barcelona anchors) and (<b>b</b>) Model 7 (YOLOv3, RGB, PSM, 300 epochs, 416-pixel image size, 9 Barcelona anchors).</p>
Full article ">Figure 4
<p>Model 1 (YOLOv3, RGB, 3-class, 1000 epochs) class confusion matrix showing the number of predicted detection labels and corresponding true labels. In the model run most detections are accurately classified as parked vehicles.</p>
Full article ">Figure 5
<p>Satellite image collected in July showing the validation data set (ground-truth) and model detections for an example area of a subset of the model detections for Model 6.</p>
Full article ">Figure 6
<p>Satellite image collected in October showing the validation data set (ground-truth) and model detections for an example area of a subset of the model detections for Model 6.</p>
Full article ">
19 pages, 14479 KiB  
Article
FCOSR: A Simple Anchor-Free Rotated Detector for Aerial Object Detection
by Zhonghua Li, Biao Hou, Zitong Wu, Bo Ren and Chen Yang
Remote Sens. 2023, 15(23), 5499; https://doi.org/10.3390/rs15235499 - 25 Nov 2023
Cited by 27 | Viewed by 2749
Abstract
Although existing anchor-based oriented object detection methods have achieved remarkable results, they require manual preset boxes, which introduce additional hyper-parameters and calculations. These methods often use more complex architectures for better performance, which makes them difficult to deploy on computationally constrained embedded platforms, [...] Read more.
Although existing anchor-based oriented object detection methods have achieved remarkable results, they require manual preset boxes, which introduce additional hyper-parameters and calculations. These methods often use more complex architectures for better performance, which makes them difficult to deploy on computationally constrained embedded platforms, such as satellites and unmanned aerial vehicles. We aim to design a high-performance algorithm that is simple, fast, and easy to deploy for aerial image detection. In this article, we propose a one-stage anchor-free rotated object detector, FCOSR, that can be deployed on most platforms and uses our well-defined label assignment strategy for the features of the aerial image objects. We use the ellipse center sampling method to define a suitable sampling region for an oriented bounding box (OBB). The fuzzy sample assignment strategy provides reasonable labels for overlapping objects. To solve the problem of insufficient sampling, we designed a multi-level sampling module. These strategies allocate more appropriate labels to training samples. Our algorithm achieves an mean average precision (mAP) of 79.25, 75.41, and 90.13 on the DOTA-v1.0, DOTA-v1.5, and HRSC2016 datasets, respectively. FCOSR demonstrates a performance superior to that of other methods in single-scale evaluation, where the small model achieves an mAP of 74.05 at a speed of 23.7 FPS on an RTX 2080-Ti GPU. When we convert the lightweight FCOSR model to the TensorRT format, it achieves an mAP of 73.93 on DOTA-v1.0 at a speed of 17.76 FPS on a Jetson AGX Xavier device with a single scale. Full article
Show Figures

Figure 1

Figure 1
<p>FCOSR architecture. The output of the backbone with the feature pyramid network (FPN) [<a href="#B40-remotesensing-15-05499" class="html-bibr">40</a>] is multi-level feature maps, including P3–P7. The head is shared with all multi-level feature maps. The predictions on the left of the head are the inference part, while the other components are only effective during the training stage. The label assignment module (LAM) allocates labels to each feature maps. <span class="html-italic">H</span> and <span class="html-italic">W</span> are the height and width of the feature map, respectively. Stride is the downsampling ratio for multi-level feature maps. <span class="html-italic">C</span> represents the number of categories, and regression branch directly predicts the center point, width, height, and angle of the target.</p>
Full article ">Figure 2
<p>Ellipse center area of OBB. The oriented rectangle represents the OBB of the target, and the shadow area represents the sampling region: (<b>a</b>) general sampling region, (<b>b</b>) horizontal center sampling region, (<b>c</b>) original elliptical region, and (<b>d</b>) shrinking elliptical region.</p>
Full article ">Figure 3
<p>A fuzzy sample label assignment demo: (<b>a</b>) is a 2D label assignment area diagram, and (<b>b</b>) is a 3D visualization effect diagram of <math display="inline"><semantics> <mrow> <mi>J</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> of two objects. The red OBB and area represent the court object, and the blue represents the ground track field. After <math display="inline"><semantics> <mrow> <mi>J</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> calculation, smaller areas in the red ellipse are allocated to the court, and other blue areas are allocated to the ground track field.</p>
Full article ">Figure 4
<p>Multi-level sampling: (<b>a</b>) Insufficient sampling, where green points in the diagram are sampling points. The ship is so narrow that there are no sampling points inside it. (<b>b</b>) A multi-level sampling demo. The red line indicates that the target follows the FCOS guidelines assigned to H6, but it is too narrow to sample effectively. The blue line indicates that the target is assigned to the lower level of features according to the MLS guidelines. This represents the target sampling at three different scales to handle the problem of insufficient sampling.</p>
Full article ">Figure 5
<p>Physical picture of the embedded object detection system based on the Nvidia Jetson platform.</p>
Full article ">Figure 6
<p>The detection result of the entire aerial image on the Nvidia Jetson platform. We completed the detection of P2043 image from the DOTA-v1.0 test set in 1.4 s on a Jetson AGX Xavier device and visualized the results. The size of this large image was 4165 × 3438.</p>
Full article ">Figure 7
<p>The FCOSR-M detection result on the DOTA-v1.0 test set. The confidence threshold is set to 0.3 when showing these results.</p>
Full article ">Figure 8
<p>The FCOSR-L detection result on HRSC2016. The confidence threshold is set to 0.3 when visualizing these results.</p>
Full article ">Figure 9
<p>Speed versus accuracy on DOTA-v1.0 single-scale test set. X indicates the ResNext backbone. R indicates the ResNet backbone. RR indicates the ReResNet(ReDet) backbone. Mobile indicates the Mobilenet v2 backbone. We tested ReDet [<a href="#B20-remotesensing-15-05499" class="html-bibr">20</a>], S<math display="inline"><semantics> <msup> <mrow/> <mn>2</mn> </msup> </semantics></math>ANet [<a href="#B16-remotesensing-15-05499" class="html-bibr">16</a>], and R<math display="inline"><semantics> <msup> <mrow/> <mn>3</mn> </msup> </semantics></math>Det [<a href="#B28-remotesensing-15-05499" class="html-bibr">28</a>] on a single RTX 2080-Ti device based on their source code. Faster-RCNN-O (FR-O) [<a href="#B8-remotesensing-15-05499" class="html-bibr">8</a>], RetinaNet-O (RN-O) [<a href="#B10-remotesensing-15-05499" class="html-bibr">10</a>], and Oriented RCNN (O-RCNN) [<a href="#B27-remotesensing-15-05499" class="html-bibr">27</a>] test results are from the OBBDetection repository<sup>2</sup>.</p>
Full article ">
23 pages, 2929 KiB  
Article
Misaligned RGB-Infrared Object Detection via Adaptive Dual-Discrepancy Calibration
by Mingzhou He, Qingbo Wu, King Ngi Ngan, Feng Jiang, Fanman Meng and Linfeng Xu
Remote Sens. 2023, 15(19), 4887; https://doi.org/10.3390/rs15194887 - 9 Oct 2023
Cited by 6 | Viewed by 3099
Abstract
Object detection based on RGB and infrared images has emerged as a crucial research area in computer vision, and the synergy of RGB-Infrared ensures the robustness of object-detection algorithms under varying lighting conditions. However, the RGB-IR image pairs captured typically exhibit spatial misalignment [...] Read more.
Object detection based on RGB and infrared images has emerged as a crucial research area in computer vision, and the synergy of RGB-Infrared ensures the robustness of object-detection algorithms under varying lighting conditions. However, the RGB-IR image pairs captured typically exhibit spatial misalignment due to sensor discrepancies, leading to compromised localization performance. Furthermore, since the inconsistent distribution of deep features from the two modalities, directly fusing multi-modal features will weaken the feature difference between the object and the background, therefore interfering with the RGB-Infrared object-detection performance. To address these issues, we propose an adaptive dual-discrepancy calibration network (ADCNet) for misaligned RGB-Infrared object detection, including spatial discrepancy and domain-discrepancy calibration. Specifically, the spatial discrepancy calibration module conducts an adaptive affine transformation to achieve spatial alignment of features. Then, the domain-discrepancy calibration module separately aligns object and background features from different modalities, making the distribution of the object and background of the fusion feature easier to distinguish, therefore enhancing the effectiveness of RGB-Infrared object detection. Our ADCNet outperforms the baseline by 3.3% and 2.5% in mAP50 on the FLIR and misaligned M3FD datasets, respectively. Experimental results demonstrate the superiorities of our proposed method over the state-of-the-art approaches. Full article
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Illustration of spatial misalignment. (<b>a</b>) Low-quality bounding boxes caused by spatial misalignment. (<b>b</b>) The fusion image generated by the method TarDAL [<a href="#B22-remotesensing-15-04887" class="html-bibr">22</a>] will produce a ghost, disturbing the localization. (<b>c</b>) The proposed ADCNet method is intended to learn the spatial relationship between RGB and IR to achieve spatial discrepancy calibration at the feature level.</p>
Full article ">Figure 2
<p>Directly fusing RGB and IR deep features with domain discrepancies will result in overlapping distributions of the object and background, making it challenging for the detection head. Our method first performs domain-discrepancy calibration on multi-modal features and then conducts feature fusion.</p>
Full article ">Figure 3
<p>Comparison of RGB-Infrared fusion methods in different stages. (<b>a</b>) Early fusion. (<b>b</b>) Mid-fusion. (<b>c</b>) Late fusion.</p>
Full article ">Figure 4
<p><b>An overview of our adaptive dual-discrepancy calibration network (ADCNet)</b> for misaligned RGB-Infrared object detection. (1) Size adaption module. The main focus of our method is on the dual-discrepancy calibration, i.e., (2) spatial discrepancy calibration and (3) domain discrepancy calibration in the graph. Among them, the rotated picture highlights the issue of spatial misalignment, and the colormap of the feature only represents the domain discrepancy.</p>
Full article ">Figure 5
<p>The flowchart of Adaptive Instance Normalization (AdaIN) and fusion. The highlighted part of the cube on the right illustrates the dimensions for normalization.</p>
Full article ">Figure 6
<p>Confusion matrix of validation results on FLIR by different methods. (<b>a</b>) ADCNet (ours). (<b>b</b>) Baseline. (<b>c</b>) Pool and nms. (<b>d</b>) CFT [<a href="#B19-remotesensing-15-04887" class="html-bibr">19</a>]. (<b>e</b>) TarDAL [<a href="#B22-remotesensing-15-04887" class="html-bibr">22</a>]. (<b>f</b>) Only infrared.</p>
Full article ">Figure 7
<p>Qualitative comparison of object-detection results in (<b>a</b>–<b>c</b>) three scenarios on the M3FD dataset. The rows from top to bottom in the figure are the results of infrared detection, TarDAL, CFT, baseline, and our ADCNet, respectively. From scene (<b>a</b>,<b>b</b>), it can be found that our ADCNet has a more satisfactory detection effect for smaller objects. In scene (<b>c</b>), the results of our method have higher confidence.</p>
Full article ">Figure 8
<p>A demonstration of the effect of our spatial discrepancy calibration module. The RGB and infrared images in the figure are from the misaligned version of the M3FD dataset. We project the RGB deep features of the baseline and the deep features through our calibration onto the IR original image (the third and fourth rows in the figure). Rows 5 and 6 highlight the cropped regions shown by the red and green dashed frames in rows 3 and 4. From the visualization results, it can be seen that the features after the spatial discrepancy calibration module have a higher coincidence with the IR image.</p>
Full article ">Figure 9
<p>Visualization of the distance relationship between two modalities’ features before (<b>a</b>) and after (<b>b</b>) domain-discrepancy calibration module through the t-SNE algorithm. Each point represents the feature of crop regions of objects and backgrounds, of which the green and blue five-pointed stars are typical examples as shown in the figure. The RGB and IR features are aligned after our domain-discrepancy calibration. The rows from top to bottom represent the three adaptive dual-discrepancy calibrations of our network, which are, respectively, located after the <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>/</mo> <mn>8</mn> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>/</mo> <mn>16</mn> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>/</mo> <mn>32</mn> </mrow> </semantics></math> downsampling of the backbone.</p>
Full article ">
19 pages, 4284 KiB  
Article
AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness
by Zechen Wang, Chun Bao, Jie Cao and Qun Hao
Remote Sens. 2023, 15(19), 4690; https://doi.org/10.3390/rs15194690 - 25 Sep 2023
Cited by 4 | Viewed by 1662
Abstract
Oriented object detection is a challenging task in scene text detection and remote sensing image analysis, and it has attracted extensive attention due to the development of deep learning in recent years. Currently, mainstream oriented object detectors are anchor-based methods. These methods increase [...] Read more.
Oriented object detection is a challenging task in scene text detection and remote sensing image analysis, and it has attracted extensive attention due to the development of deep learning in recent years. Currently, mainstream oriented object detectors are anchor-based methods. These methods increase the computational load of the network and cause a large amount of anchor box redundancy. In order to address this issue, we proposed an anchor-free oriented object detection method based on Gaussian centerness (AOGC), which is a single-stage anchor-free detection method. Our method uses contextual attention FPN (CAFPN) to obtain the contextual information of the target. Then, we designed a label assignment method for the oriented objects, which can select positive samples with higher quality and is suitable for large aspect ratio targets. Finally, we developed a Gaussian kernel-based centerness branch that can effectively determine the significance of different anchors. AOGC achieved a mAP of 74.30% on the DOTA-1.0 datasets and 89.80% on the HRSC2016 datasets, respectively. Our experimental results show that AOGC exhibits superior performance to other methods in single-stage oriented object detection and achieves similar performance to the two-stage methods. Full article
Show Figures

Figure 1

Figure 1
<p>Problems that exist in remote sensing object detection when using horizontal bounding boxes. (<b>a</b>) The overlapping phenomenon of anchor boxes when using horizontal bounding boxes to detect objects in aerial images. The green box is the detected horizontal bounding box. (<b>b</b>) The problem where the proportion of target pixels is smaller when using horizontal bounding boxes to detect objects with large aspect ratios in aerial images. The red area in the figure is the target pixel, and the green area is the background pixel.</p>
Full article ">Figure 2
<p>Anchor-based detector preset anchor box redundancy. The green boxes in the figure are the preset rotation boxes of the anchor-based detector.</p>
Full article ">Figure 3
<p>The overall architecture of the AOGC network, where <span class="html-italic">C</span><sub>3</sub>, <span class="html-italic">C</span><sub>4</sub>, and <span class="html-italic">C</span><sub>5</sub> represent the output features of the backbone, and CAM is the contextual attention module. <span class="html-italic">H</span> and <span class="html-italic">W</span> are the width and height of the feature map, and <span class="html-italic">C</span> is the number of categories in the network classification. OLA is the oriented label assignment method we designed, and Gaussian centerness is a centerness branch based on a two-dimensional Gaussian kernel.</p>
Full article ">Figure 4
<p>CAFPN structure diagram. Among them, DConv represents dilated convolution, IP means interpolation operation, and Conv (1 × 1) represents 1 × 1 convolution for channel conversion. (<b>a</b>–<b>c</b>) are the implementation methods of CAFPN three-layer structure respectively.</p>
Full article ">Figure 5
<p>The FCOS positive and negative sample division method does not match the OBB detection. The blue area in the figure is the positive sample point area of FCOS, and the red box is the ground truth.</p>
Full article ">Figure 6
<p>Oriented label assignment method and its modification for large aspect ratio objects. (<b>a</b>) After reducing the sampling area of the positive sample, the number of positive sample points of the target with a large aspect ratio was noted to be too small. In the figure, the red box is the original sampling area, and the green box is the sampling area reduced by half. (<b>b</b>) The sampling area after correction and the blue area is the sampling area of the positive sample.</p>
Full article ">Figure 7
<p>The representation heat map of the Gaussian centerness.</p>
Full article ">Figure 8
<p>Partial visualization results of our method on the DOTA-1.0 dataset.</p>
Full article ">Figure 9
<p>Partial visualization results of FCOS-R and our method. (<b>a</b>) Part of the visualization results of the FCOS-R method on the DOTA dataset. (<b>b</b>) Part of the visualization results of our method on the DOTA dataset.</p>
Full article ">Figure 10
<p>Partial visualization results of our method on the HRSC2016 dataset.</p>
Full article ">
20 pages, 185958 KiB  
Article
High-Resolution Network with Transformer Embedding Parallel Detection for Small Object Detection in Optical Remote Sensing Images
by Xiaowen Zhang, Qiaoyuan Liu, Hongliang Chang and Haijiang Sun
Remote Sens. 2023, 15(18), 4497; https://doi.org/10.3390/rs15184497 - 13 Sep 2023
Cited by 3 | Viewed by 1839
Abstract
Small object detection in remote sensing enables the identification and analysis of unapparent but important information, playing a crucial role in various ground monitoring tasks. Due to the small size, the available feature information contained in small objects is very limited, making them [...] Read more.
Small object detection in remote sensing enables the identification and analysis of unapparent but important information, playing a crucial role in various ground monitoring tasks. Due to the small size, the available feature information contained in small objects is very limited, making them more easily buried by the complex background. As one of the research hotspots in remote sensing, although many breakthroughs have been made, there still exist two significant shortcomings for the existing approaches: first, the down-sampling operation commonly used for feature extraction can barely preserve weak features of objects in a tiny size; second, the convolutional neural network methods have limitations in modeling global context to address cluttered backgrounds. To tackle these issues, a high-resolution network with transformer embedding parallel detection (HRTP-Net) is proposed in this paper. A high-resolution feature fusion network (HR-FFN) is designed to solve the first problem by maintaining high spatial resolution features with enhanced semantic information. Furthermore, a Swin-transformer-based mixed attention module (STMA) is proposed to augment the object information in the transformer block by establishing a pixel-level correlation, thereby enabling global background–object modeling, which can address the second shortcoming. Finally, a parallel detection structure for remote sensing is constructed by integrating the attentional outputs of STMA with standard convolutional features. The proposed method effectively mitigates the impact of the intricate background on small objects. The comprehensive experiment results on three representative remote sensing datasets with small objects (MASATI, VEDAI and DOTA datasets) demonstrate that the proposed HRTP-Net achieves a promising and competitive performance. Full article
Show Figures

Figure 1

Figure 1
<p>Examples of challenges in object detection in remote sensing images. (<b>a</b>) An extremely small object is in the bottom left corner of the image. (<b>b</b>) The ship object is surrounded by complex water waves. (<b>c</b>) Objects are disturbed by illumination variation and shadows. (<b>d</b>) There is similar interference around objects.</p>
Full article ">Figure 2
<p>Overview of the proposed method HRTP-Net, which contains four main components, including: (<b>1</b>) the backbone for extracting basic high-resolution feature information; (<b>2</b>) HR-FFN for retaining the semantic information of high-resolution features, shown in <a href="#remotesensing-15-04497-f003" class="html-fig">Figure 3</a>; (<b>3</b>) the auxiliary detection structure consist with the attention module STMA for enhancing feature effectiveness, shown in <a href="#remotesensing-15-04497-f004" class="html-fig">Figure 4</a>; (<b>4</b>) the main detection structure for aggregating different features and final detection.</p>
Full article ">Figure 3
<p>(<b>a</b>) Proposed structure of HR-FFN. HR-FFN consists of a basic block, fuse layer, and E-SPP. The black arrow represents the network forward propagation. The red arrow in the figure indicates the residual connection that combines the shallow and deep features. (<b>b</b>) Structure of ECA Spatial Pyramid Pooling (E-SPP), which consists of an ECA module and an SPP module.</p>
Full article ">Figure 4
<p>(<b>a</b>) Standard Swin transformer block [<a href="#B17-remotesensing-15-04497" class="html-bibr">17</a>]. (<b>b</b>) Structure of STMA, consisting of window-based transformer block and mixed attention module (MAM). Differently from the standard Swin transformer block, we replace the shifted window-based transformer block with our proposed MAM to enhance the interaction between windows.</p>
Full article ">Figure 5
<p>The proportion of images with different “object image ratio” in the MASATI, VEDAI, and DOTA datasets.</p>
Full article ">Figure 6
<p>Comparison of the detection results before and after using HR-FFN in the HRTP-Net. (<b>a</b>) Baseline. (<b>b</b>) Baseline + HR-FFN.</p>
Full article ">Figure 7
<p>Visualization of feature maps and detection results. (<b>a</b>) Visualization of feature maps produced by the model without STMA. (<b>b</b>) Visualization of feature maps produced by the model with STMA. It can be seen that STMA can eliminate most obstructed backgrounds. (<b>c</b>) Detection results with baseline+auxiliary detection structure.</p>
Full article ">Figure 8
<p>Examples of detection results on the MASATI dataset using different methods involving YOLOv5, HRNet, PAG-YOLO, TPH-YOLOv5, and the proposed method.</p>
Full article ">Figure 9
<p>Selected examples from the detection results from the VEDAI dataset obtained by HRTP-Net.</p>
Full article ">Figure 10
<p>Selected examples from the detection results in the DOTA dataset obtained by HRTP-Net.</p>
Full article ">Figure 11
<p>Test results of the algorithm under different challenges. (<b>a</b>) An extremely small object is in the bottom left corner of the image. (<b>b</b>) The ship object is surrounded by complex water waves. (<b>c</b>) Objects are disturbed by illumination variation and shadows. (<b>d</b>) There is similar interference around objects.</p>
Full article ">
20 pages, 43092 KiB  
Article
RTV-SIFT: Harnessing Structure Information for Robust Optical and SAR Image Registration
by Siqi Pang, Junyao Ge, Lei Hu, Kaitai Guo, Yang Zheng, Changli Zheng, Wei Zhang and Jimin Liang
Remote Sens. 2023, 15(18), 4476; https://doi.org/10.3390/rs15184476 - 12 Sep 2023
Cited by 4 | Viewed by 1695
Abstract
Registration of optical and synthetic aperture radar (SAR) images is challenging because extracting located identically and unique features on both images are tricky. This paper proposes a novel optical and SAR image registration method based on relative total variation (RTV) and scale-invariant feature [...] Read more.
Registration of optical and synthetic aperture radar (SAR) images is challenging because extracting located identically and unique features on both images are tricky. This paper proposes a novel optical and SAR image registration method based on relative total variation (RTV) and scale-invariant feature transform (SIFT), named RTV-SIFT, to extract feature points on the edges of structures and construct structural edge descriptors to improve the registration accuracy. First, a novel RTV-Harris feature point detection method by combining the RTV and the multiscale Harris algorithm is proposed to extract feature points on both images’ significant structures. This ensures a high repetition rate of the feature points. Second, the feature point descriptors are constructed on enhanced phase congruency edge (EPCE), which combines the Sobel operator and maximum moment of phase congruency (PC) to extract edges from structured images that enhance robustness to nonlinear intensity differences and speckle noise. Finally, after coarse registration, the position and orientation Euclidean distance (POED) between feature points is utilized to achieve fine feature point matching to improve the registration accuracy. The experimental results demonstrate the superiority of the proposed RTV-SIFT method in different scenes and image capture conditions, indicating its robustness and effectiveness in optical and SAR image registration. Full article
Show Figures

Figure 1

Figure 1
<p>Feature points extracted and matched in an optical image (<b>left</b>) and a SAR image (<b>right</b>) by (<b>a</b>) PSO-SIFT, (<b>b</b>) SAR-SIFT, and (<b>c</b>) our RTV-SIFT method. The points shown in the images are the detected feature points, with the green points indicating the correctly matched points. mCMR is the mean correct matching ratio (CMR, defined in <a href="#sec3-remotesensing-15-04476" class="html-sec">Section 3</a>).</p>
Full article ">Figure 2
<p>Diagram of the RTV-SIFT method. Abbreviations: RTV: Relative Total Variation; EPCE: Enhanced Phase Consistent Edge; NNDR: Nearest Neighbor Distance Ratio; FSC: Fast Sample consensus; POED: Position Orientation Euclidean Distance. “&amp;”, “x”, “+” denote the “and”, “multiply” and “add” operations for the corresponding pixel values, respectively.</p>
Full article ">Figure 3
<p>Example edge maps of (<b>a</b>) an optical image extracted by (<b>b</b>) the Sobel operator, (<b>c</b>) the phase congruency, and (<b>d</b>) the proposed enhanced phase congruency edge detector.</p>
Full article ">Figure 4
<p>Randomly selected 8 pairs of aligned optical (top) and SAR (bottom) images from the OS dataset [<a href="#B38-remotesensing-15-04476" class="html-bibr">38</a>]. where, (<b>a</b>,<b>c</b>,<b>h</b>) is urban area, (<b>b</b>,<b>f</b>,<b>g</b>) is farmland area, (<b>e</b>) is suburban area, and (<b>d</b>) is airport area.</p>
Full article ">Figure 5
<p>11 pairs of unaligned large-scene optical and SAR images. (<b>a</b>–<b>k</b>) are image pairs 1–11, respectively.</p>
Full article ">Figure 6
<p>The average CMN (<b>a</b>) and the consumption time (<b>b</b>) of the RTV-SIFT method with the number of layers N in the RTV iteration space varying from 1 to 11.</p>
Full article ">Figure 7
<p>(<b>a</b>–<b>h</b>) are the feature points repeatability rate of the images pairs (<b>a</b>–<b>h</b>) in <a href="#remotesensing-15-04476-f004" class="html-fig">Figure 4</a>, respectively. The independent variable of the line graph is the localization error.</p>
Full article ">Figure 8
<p>CMN of feature descriptors based on different edge detectors.</p>
Full article ">Figure 9
<p>Qualitative comparison of different registration methods, from left to right, Harris-PIIFD, OS-SIFT, PSO-SIFT, and RTV-SIFT (ours). (<b>a</b>–<b>d</b>) are the results on test image pair 1, 2, 7, and 11, respectively.</p>
Full article ">Figure 10
<p>Qualitative comparison of the registration results of PSO-SIFT and RTV-SIFT (ours) on test image pair 1. The registration results are shown as checkerboard images. The left patches of the zoomed-in views in (<b>b</b>) are from the result of PSO-SIFT, and the right patches are from that of RTV-SIFT.</p>
Full article ">Figure 11
<p>Example images of simulated optical images under different imaging condition and their registration results by RTV-SIFT method. (<b>a</b>,<b>d</b>,<b>g</b>) are the simulated optical images with illumination variation, noise, and cloud covering, respectively. (<b>b</b>,<b>e</b>,<b>h</b>) show the corresponding points in the image pairs. (<b>c</b>,<b>f</b>,<b>i</b>) show the checkerboard images of the registration results.</p>
Full article ">Figure 12
<p>Average CMN (top row) and average <math display="inline"><semantics> <msub> <mi>S</mi> <mrow> <mi>c</mi> <mi>a</mi> <mi>t</mi> </mrow> </msub> </semantics></math> (bottom row) of the PSO-SIFT, OS-SIFT, and RTV-SIFT (ours) on the test images under different simulated optical imaging conditions. (<b>a</b>,<b>d</b>) Results of simulated illumination variation. (<b>b</b>,<b>e</b>) Results of simulated noise interference. (<b>c</b>,<b>f</b>) Results of simulated cloud obscuration.</p>
Full article ">Figure 13
<p>Registration effects in local weakly structured regions.</p>
Full article ">
Back to TopTop