SS R-CNN: Self-Supervised Learning Improving Mask R-CNN for Ship Detection in Remote Sensing Images
"> Figure 1
<p>Illustration of the designed self-supervised-based object detection network SS R-CNN. (<b>a</b>) Self-supervised learning module. The network is trained by a more-way CutPaste classification task on images with no target objects. (<b>b</b>) Object detection module. The representation network migrates from the self-supervised learning module to a modified Mask R-CNN network that generates dense candidate anchor frames through the RPN network and determines the final detection box by the ROI align technique.</p> "> Figure 2
<p>The CutPaste self-supervised tasks. After color jittering, the input image is subjected to various cut and paste operations, and the feature representation network is trained by auxiliary classification tasks. The more-way CutPaste operation is specially designed in this work.</p> "> Figure 3
<p>The framework of FPN. C1-C5 indicate ResNet convolution layers; 1 × 1 and 3 × 3 denote 1-dimensional convolution and 3-dimensional convolution operations, respectively; 2× denotes 2-fold upsampling, and 0.5× denotes 0.5-fold downsampling, i.e., max-pooling; ⊕ stands for the element-wise summing operation; P2–P6 denote the output fused feature maps; 200, 100, 50, 25, and 13 are the corresponding dimensions of feature maps of P2–P5.</p> "> Figure 4
<p>Illustration of the Airbus dataset. The captured ships take up various sizes in the image area, with (<b>a</b>) large ships occupying areas more than <math display="inline"><semantics> <mrow> <mn>96</mn> <mo>×</mo> <mn>96</mn> </mrow> </semantics></math> pixels, (<b>b</b>) medium-sized ships occupy areas between that of the large and small ships, (<b>c</b>) and small ships occupy areas less than <math display="inline"><semantics> <mrow> <mn>32</mn> <mo>×</mo> <mn>32</mn> </mrow> </semantics></math> pixels; (<b>d</b>) images of marine surface with no ships.</p> "> Figure 5
<p>Detected ships of the tested methods. The leftmost column displays several typical remote sensing images, including scenarios with multiple ships (the first two rows), middle-sized ships (the third row), small ships (the fourth row), and large ships accompanied by small boats (fifth row). The following columns display the ships detected by SS R-CNN, SSD, and Mask RCNN, respectively, in different images. The detected ships are marked with rectangles, while the red rectangles in the first column are the ground-truth boxes of the dataset.</p> "> Figure 6
<p>Variation of the accuracies of SS R-CNN with different CutPaste tasks.</p> "> Figure 7
<p>Detected objects by SS R-CNN with various CutPaste tasks in different images. Detection frames of different methods are depicted with different colors.</p> ">
Abstract
:1. Introduction
- A self-supervised task is specially designed for small object detection from remote sensing images, which takes advantage of the characteristic that there often exist multiple ships of different sizes located in a common interested image, improving the representative capability of the backbone network.
- A self-supervised-based marine ship detection method is proposed for small-sized remote sensing datasets. Semantic representation learning is accomplished by making full use of unannotated images that contain no detection objects. Hence, the detection method greatly reduces the number of annotated images, improves the detection accuracy, and expands the scope of its application.
2. Method
2.1. The CutPaste-Based Self-Supervised Learning Module
- (i)
- The block operation cuts a rectangle area, applies color jittering, and then pastes it onto a random position of the image. In this work, we specially set larger aspect ratios to generate elongated rectangles and set various areas of the rectangles. Specifically, certain small-sized rectangles are generated.
- (ii)
- The scar operation elongates and then rotates the clipping area after a block operation.
- (iii)
- The 3-way operation combines the two above types of operations. It enhances an image randomly by a block or scar operation. The corresponding classifier identifies the image as a normal marine image, an image output by block, or an image generated by scar.
- (iv)
- The more-way operation has the following procedures:
- (a)
- Select and cut 0–20 rectangles randomly over various areas;
- (b)
- Perform rotation, color dithering and scaling of the selected rectangle areas;
- (c)
- Paste the rectangles randomly onto the original image, and the auxiliary learning task aims to detect the number of cut-and-paste rectangles in the image.
2.2. The Object Detection Module
3. Results
3.1. Dataset and Platforms
3.2. Experimental Setup
3.3. Comparison with the Baseline Methods
- (i)
- It can be seen from Table 2 that the detection accuracies of SS R-CNN are better than those of the other tested supervised methods, Mask R-CNN and SSD, in terms of mAP, AP50, AP75, APs, and APm. As one main difference between SS R-CNN and Mask R-CNN is the designed self-supervised learning module, the results indicate that the self-supervised learning module of SS R-CNN has extracted helpful semantic feature information from the unlabeled images.
- (ii)
- By comparing the accuracies of SS R-CNN with Mask R-CNN pretrained by MOCO and SimCLR (the last two rows of Table 2), it can be seen that the designed more-way CutPaste module more effectively captures feature presentations for the downstream ship detection task.
- (iii)
- For large target objects, the accuracy of SS R-CNN and Mask R-CNN have a large gap compared with SSD in terms of APl. One main reason is that SS R-CNN also employs the Mask R-CNN module, whose detection capacity for large objects is restricted by the number of labeled images. Another known active factor is the size of candidate anchor frames. We discuss these two factors in detail in this subsection.
- (iv)
- The detection performance of small target objects is the main drawback of SSD.
- (i)
- For the images with multiple ships (the first two rows), SS R-CNN correctly detects multiple objects, alleviating the issue of missing vessels;
- (ii)
- SS R-CNN has better performance for small ships and middle-sized ships compared with the supervised methods, SSD and Mask R-CNN, as the predicted frames are more accurate;
- (iii)
- The detected frames of SS R-CNN and Mask R-CNN are not as accurate as that of the SSD method for large ship detection. In the last row of the Figure 5, there is a large ship with a small boat beside it. SSD correctly detects the two objects, while the predicted frames of SS R-CNN and Mask R-CNN are not accurate.
Exploring the Reason for Relatively Low Detection Accuracies for Large Ships
3.4. Employing Different CutPaste Tasks in the Self-Supervised Learning Module
3.5. Effect of the Number of Labeled Training Images
3.6. Effects of the Number of Rectangles in More-Way CutPaste Operation for SS R-CNN
4. Discussion
4.1. Effects of the Input Image Resolution
4.2. Deficiencies of SS R-CNN
5. Conclusions
- (i)
- The proposed self-supervised learning module can extract semantic features from unlabeled clean images of marine surfaces. Compared to the baseline supervised learning methods, SS R-CNN evidently improved the detection accuracy in the case of a limited number of labeled images. Compared with the best of the baseline supervised methods, SS R-CNN showed improvement in terms of mAP and 22.8% improvement in detection accuracy for small target objects.
- (ii)
- Compared with the typical CutPaste tasks, the proposed self-supervised learning module incorporated with the designed more-way CutPaste task can further reduce the number of undetected objects and incorrectly detected objects and, hence, improve the accuracy.
- (iii)
- The proposed self-supervised-based detection framework greatly reduces the requirement for the number of labeled images. For instance, with 200 labeled images, the SS R-CNN achieved an mAP of .
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard negative mixing for contrastive learning. In Proceedings of the Annual Conference on Neural Information Processing Systems, Virtual, 6–11 December 2020; pp. 21798–21809. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9726–9735. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.; Azar, M.G.; et al. Bootstrap your own latent—A new approach to self-supervised learning. In Proceedings of the Annual Conference on Neural Information Processing Systems, Virtual, 6–11 December 2020; pp. 21271–21284. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 15509–15519. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Li, C.L.; Sohn, K.; Yoon, J.; Pfister, T. Cutpaste: Self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9664–9674. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Kaggle. Airbus Ship Detection Challenge. Available online: https://www.kaggle.com/c/airbus-ship-detection/data (accessed on 31 July 2018).
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Metric | Characteristic |
---|---|
mAP | the average AP values, which are calculated with the IoU thresholds located in the interval with step size |
AP50 | the AP value calculated with the IoU metric has a threshold of |
AP75 | the AP value calculated with the IoU metric has a threshold of |
APs | the AP value for small objects (occupied area ) |
APm | the AP value for medium objects (occupied area is located in the interval ) |
APl | the AP value for large objects (occupied area ) |
mAP | AP50 | AP75 | APs | APm | APl | |
---|---|---|---|---|---|---|
SS R-CNN | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 |
Mask R-CNN | 0.528 | 0.688 | 0.559 | 0.505 | 0.649 | 0.199 |
SSD | 0.257 | 0.513 | 0.230 | 0.127 | 0.536 | 0.541 |
MOCO + Mask R-CNN | 0.520 | 0.698 | 0.548 | 0.557 | 0.589 | 0.105 |
SimCLR + Mask R-CNN | 0.484 | 0.657 | 0.502 | 0.540 | 0.550 | 0.108 |
mAP | AP50 | AP75 | APs | APm | APl | |
---|---|---|---|---|---|---|
SS R-CNN (6:3:1) | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 |
SS R-CNN (1:1:8) | 0.374 | 0.573 | 0.382 | 0.395 | 0.453 | 0.246 |
SS R-CNN (3:3:4) | 0.492 | 0.671 | 0.524 | 0.491 | 0.617 | 0.251 |
Mask R-CNN (6:3:1) | 0.528 | 0.688 | 0.559 | 0.505 | 0.649 | 0.199 |
Mask R-CNN (1:1:8) | 0.366 | 0.543 | 0.384 | 0.370 | 0.435 | 0.260 |
Mask R-CNN (3:3:4) | 0.481 | 0.664 | 0.507 | 0.482 | 0.596 | 0.269 |
mAP | AP50 | AP75 | APs | APm | APl | |
---|---|---|---|---|---|---|
SS R-CNN | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 |
SS R-CNN (adjusted) | 0.588 | 0.754 | 0.618 | 0.621 | 0.657 | 0.267 |
mAP | AP50 | AP75 | APs | APm | APl | |
---|---|---|---|---|---|---|
block | 0.566 | 0.733 | 0.602 | 0.569 | 0.694 | 0.127 |
scar | 0.553 | 0.697 | 0.588 | 0.521 | 0.698 | 0.227 |
3-way | 0.560 | 0.717 | 0.584 | 0.537 | 0.686 | 0.196 |
more-way | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 |
Method | Training Size | mAP | AP50 | AP75 | APs | APm | APl |
---|---|---|---|---|---|---|---|
SS R-CNN | 200 | 0.476 | 0.650 | 0.494 | 0.515 | 0.536 | 0.146 |
400 | 0.514 | 0.691 | 0.543 | 0.546 | 0.606 | 0.215 | |
1000 | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 | |
2000 | 0.594 | 0.754 | 0.630 | 0.642 | 0.657 | 0.158 | |
5000 | 0.598 | 0.748 | 0.627 | 0.661 | 0.656 | 0.158 | |
SSD | 200 | 0.179 | 0.407 | 0.142 | 0.105 | 0.364 | 0.215 |
400 | 0.222 | 0.438 | 0.201 | 0.121 | 0.478 | 0.245 | |
1000 | 0.257 | 0.513 | 0.230 | 0.127 | 0.536 | 0.541 | |
2000 | 0.249 | 0.492 | 0.234 | 0.129 | 0.512 | 0.396 | |
5000 | 0.297 | 0.561 | 0.288 | 0.156 | 0.595 | 0.524 | |
MOCO + Mask R-CNN | 200 | 0.466 | 0.652 | 0.476 | 0.506 | 0.531 | 0.112 |
400 | 0.507 | 0.681 | 0.537 | 0.550 | 0.580 | 0.107 | |
1000 | 0.520 | 0.698 | 0.548 | 0.557 | 0.589 | 0.105 | |
2000 | 0.557 | 0.720 | 0.583 | 0.604 | 0.595 | 0.106 | |
5000 | 0.554 | 0.736 | 0.584 | 0.608 | 0.619 | 0.116 |
mAP | AP50 | AP75 | APs | APm | APl | |
---|---|---|---|---|---|---|
SS R-CNN (#rectangles ) | 0.622 | 0.758 | 0.658 | 0.620 | 0.723 | 0.158 |
SS R-CNN (#rectangles ) | 0.519 | 0.682 | 0.555 | 0.537 | 0.591 | 0.153 |
SS R-CNN (#rectangles ) | 0.543 | 0.714 | 0.560 | 0.594 | 0.586 | 0.098 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jian, L.; Pu, Z.; Zhu, L.; Yao, T.; Liang, X. SS R-CNN: Self-Supervised Learning Improving Mask R-CNN for Ship Detection in Remote Sensing Images. Remote Sens. 2022, 14, 4383. https://doi.org/10.3390/rs14174383
Jian L, Pu Z, Zhu L, Yao T, Liang X. SS R-CNN: Self-Supervised Learning Improving Mask R-CNN for Ship Detection in Remote Sensing Images. Remote Sensing. 2022; 14(17):4383. https://doi.org/10.3390/rs14174383
Chicago/Turabian StyleJian, Ling, Zhiqi Pu, Lili Zhu, Tiancan Yao, and Xijun Liang. 2022. "SS R-CNN: Self-Supervised Learning Improving Mask R-CNN for Ship Detection in Remote Sensing Images" Remote Sensing 14, no. 17: 4383. https://doi.org/10.3390/rs14174383