FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection
<p>(<b>a</b>) Extracting feature pyramids from image pyramids, which is an inefficient way. (<b>b</b>) Performing detection on a pyramidal feature hierarchy generated from a convolutional neural network (ConvNet), like the single shot multibox detector (SSD) [<a href="#B4-electronics-09-01536" class="html-bibr">4</a>]. (<b>c</b>) Top-down feature fusion methods adopted by [<a href="#B5-electronics-09-01536" class="html-bibr">5</a>,<a href="#B6-electronics-09-01536" class="html-bibr">6</a>]. (<b>d</b>) Fusing whole-scale features in every scale, such as adaptively spatial feature fusion (ASFF) [<a href="#B7-electronics-09-01536" class="html-bibr">7</a>] and rainbow single shot detector (R-SSD) [<a href="#B8-electronics-09-01536" class="html-bibr">8</a>]. (<b>e</b>) Fusing multi-scale features for further extraction. (<b>f</b>) Our proposed feature fusion method. Only the adjacent-scale features up to the current layer are fused.</p> "> Figure 2
<p>Schematic of the feature fusion and spatial attention-based single shot detector (FASSD) architecture.</p> "> Figure 3
<p>Schematic of the architecture of a simplified FASSD version for small object detection.</p> "> Figure 4
<p>Feature fusion block and spatial attention block. The operation of “+1” is equivalent to residual connection.</p> "> Figure 5
<p>Offline data augmentation.</p> "> Figure 6
<p>Curves of detection rate and false alarm rate.</p> "> Figure 7
<p>Visualization of feature maps. From left to right, the figure shows the original image (column 1), the previous feature maps after attention block (column 2), attention mask extracted by the branch with 1 × 1 kernels (column 3), attention mask extracted by the branch with 3 × 3 kernels (column 4), the fusion of attention mask (column 5), new feature maps output by the attention block (column 6).</p> "> Figure 8
<p>The detection results of FASSD compared with SSD. The second row of each subfigure shows the detection results of FASSD.</p> "> Figure 8 Cont.
<p>The detection results of FASSD compared with SSD. The second row of each subfigure shows the detection results of FASSD.</p> "> Figure 9
<p>Visualization of results on LAKE-BOAT. The second row of each subfigure shows the detection results of FASSD.</p> "> Figure 9 Cont.
<p>Visualization of results on LAKE-BOAT. The second row of each subfigure shows the detection results of FASSD.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. Feature Fusion Methods Based on SSD
2.1.1. Using a Feature Pyramid Network to Enhance Feature Extraction
2.1.2. Multi-Scale Feature Fusion
2.1.3. Enhanced Modules
2.1.4. Dense Connection
2.2. Attention Mechanism
2.3. Small Object Detection
2.3.1. Data Augmentation: Increasing the Number of Small Objects
2.3.2. Detection in High-Resolution Maps
2.3.3. Increasing the Number of Matching Anchors for Small Objects
3. Methods
3.1. FASSD Architecture
3.1.1. FASSD Architecture
3.1.2. Simplified FASSD Version for Small Object Detection
3.2. Feature Fusion Block
3.3. Spatial Attention Block
3.4. LAKE-BOAT Dataset for Small Object Detection
3.5. Training
3.5.1. Data Augmentation
3.5.2. Transfer Learning
3.5.3. Anchor Setting
3.5.4. Loss Function
4. Experiments
4.1. Ablation Study of VOC
4.1.1. Extra Scale of 75 × 75
4.1.2. Feature Fusion
4.1.3. Two Versions of SAB
4.1.4. Group Convolution
4.2. Results on PASCAL VOC2007
4.3. Inference Speed on PASCAL VOC2007 Test
4.4. Small Object Detection on LAKE-BOAT
4.4.1. Results and Inference Speed
4.4.2. Transfer Learning and Data Augmentation
4.4.3. Detection Rate and False Alarm Rate
4.5. Visualization Analysis
4.6. Visualization of Results
5. Conclusions and Future Work
Author Contributions
Funding
Conflicts of Interest
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In NIPS 2012; Neural Information Processing Systems Foundation, Inc. (NIPS): Lake Tahoe, NV, USA, January 2012; pp. 1097–1105. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar] [CrossRef] [Green Version]
- Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar]
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
- Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2det: A single-shot object detector based on multi-level feature pyramid network. arXiv 2018, arXiv:1811.04533. [Google Scholar] [CrossRef]
- Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
- Cui, L.; Ma, R.; Lv, P.; Jiang, X.; Gao, Z.; Zhou, B.; Xu, M. Mdssd: Multi-scale deconvolutional single shot detector for small objects. Sci. China Inf. Sci. 2020, 63, 120113. [Google Scholar] [CrossRef] [Green Version]
- Zhao, Q.; Sheng, T.; Wang, Y.; Ni, F.; Cai, L. Cfenet: An accurate and efficient single-shot object detector for autonomous driving. arXiv 2018, arXiv:1806.0979. [Google Scholar]
- Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. arXiv 2017, arXiv:1711.07767. [Google Scholar]
- Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.-G.; Chen, Y.; Xue, X. Dsod: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1919–1927. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1857–1866. [Google Scholar]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; So Kweon, I. Cbam: Convolutional block attention module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.-S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
- Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3d object proposals for accurate object class detection. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, December 2015; pp. 424–432. Available online: http://papers.nips.cc/paper/5644-3d-object-proposals-for-accurate-object-class-detection (accessed on 13 July 2019).
- Hu, P.; Ramanan, D. Finding tiny faces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 951–959. [Google Scholar]
- Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1222–1230. [Google Scholar]
- Krishna, H.; Jawahar, C. Improving small object detection. In Proceedings of the 4th IAPR Asian Conference on Pattern Recognition (ACPR), Nanjing, China, 26–29 November 2017; pp. 340–345. [Google Scholar]
- Mao, X.-J.; Shen, C.; Yang, Y.-B. Image restoration using convolutional auto-encoders with symmetric skip connections. arXiv 2016, arXiv:1606.08921. [Google Scholar]
- Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. Faceboxes: A CPU real-time face detector with high accuracy. In Proceedings of the 2017 IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October. 2017; pp. 1–9. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Degroot, M.; Brown, E. SSD: Single Shot MultiBox Object Detector, in PyTorch. Available online: https://github.com/amdegroot/ssd.pytorch (accessed on 20 July 2019).
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Xiang, W.; Zhang, D.-Q.; Yu, H.; Athitsos, V. Context-aware single-shot detector. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1784–1793. [Google Scholar]
Train Set | Aug Set | Training Set | Test Set | |
---|---|---|---|---|
Images | 250 | 250 | 500 | 100 |
Objects | 954 | 3816 | 4770 | 625 |
Extra-small objects | 465 | 3084 | 3549 | 426 |
Small objects | 314 | 580 | 894 | 134 |
Medium objects | 138 | 132 | 270 | 48 |
Large objects | 32 | 20 | 52 | 13 |
Extra-large objects | 5 | 0 | 5 | 4 |
Proportion of extra-small and small objects | 81.7% | 96.0% | 93.1% | 89.6% |
Level | Attribute | 75 × 75 maps | 300 × 300 Images |
---|---|---|---|
1 | Extra-small | 0–32 | 0–122 |
2 | Small | 32–62 | 122–242 |
3 | Medium | 62–122 | 242–482 |
4 | Large | 122–242 | 482–962 |
5 | Extra-large | 242– | 962– |
Method | 75 × 75 | Attention | Fusion | Anchors | mAP | FPS | |
---|---|---|---|---|---|---|---|
v1 | v2 | ||||||
SSD300 | ✘ | ✘ | ✘ | ✘ | 8732 | 77.7 | 69.7 |
FASSD300 | ✘ | ✔ | ✘ | ✘ | 8732 | 78.1 | 67.8 |
FASSD300 | ✘ | ✘ | ✔ | ✘ | 8732 | 78.0 | 58.4 |
FASSD300 | ✘ | ✘ | ✔ | ✔ | 8732 | 78.2 | 47.8 |
FASSD300 | ✔ | ✘ | ✘ | ✔ | 31,232 | 78.2 | 54.2 |
FASSD300 | ✔ | ✔ | ✘ | ✔ | 31,232 | 78.9 | 52.4 |
FASSD300 | ✔ | ✘ | ✔ | ✔ | 31,232 | 79.3 | 45.3 |
Method | Backbone | mAP | Aero | Bike | Bird | Boat | Bottle | Bus | Car | Cat | Chair | Cow | Table | Dog | Horse | mBike | Person | Plant | Sheep | Sofa | Train | TV |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Faster [36] | VGG | 73.2 | 76.5 | 79.0 | 70.9 | 65.5 | 52.1 | 83.1 | 84.7 | 86.4 | 52.0 | 81.9 | 65.7 | 84.8 | 84.6 | 77.5 | 76.7 | 38.8 | 73.6 | 73.9 | 83.0 | 72.6 |
Faster [37] | Residual-101 | 76.4 | 79.8 | 80.7 | 76.2 | 68.3 | 55.9 | 85.1 | 85.3 | 89.8 | 56.7 | 87.8 | 69.4 | 88.3 | 88.9 | 80.9 | 78.4 | 41.7 | 78.6 | 79.8 | 85.3 | 72.0 |
CSSD300 [38] | VGG | 78.1 | 82.2 | 85.4 | 76.5 | 69.8 | 51.1 | 86.4 | 86.4 | 88.0 | 61.6 | 82.7 | 76.4 | 86.5 | 87.9 | 85.7 | 78.8 | 54.2 | 76.9 | 77.6 | 88.9 | 78.2 |
DSSD321 [5] | Residual-101 | 78.6 | 81.9 | 84.9 | 80.5 | 68.4 | 53.9 | 85.6 | 86.2 | 88.9 | 61.1 | 83.5 | 78.7 | 86.7 | 88.7 | 86.7 | 79.7 | 51.7 | 78.0 | 80.9 | 87.2 | 79.4 |
MDSSD300 [13] | VGG | 78.6 | 86.5 | 87.6 | 78.9 | 70.6 | 55.0 | 86.9 | 87.0 | 88.1 | 58.5 | 84.8 | 73.4 | 84.8 | 89.2 | 88.1 | 78.0 | 52.3 | 78.6 | 74.5 | 86.8 | 80.7 |
SSD300* [4] | VGG | 77.5 | 79.5 | 83.9 | 76.0 | 69.6 | 50.5 | 87.0 | 85.7 | 88.1 | 60.3 | 81.5 | 77.0 | 86.1 | 87.5 | 84.0 | 79.4 | 52.3 | 77.9 | 79.5 | 87.6 | 76.8 |
SSD512* [4] | VGG | 79.5 | 84.8 | 85.1 | 81.5 | 73.0 | 57.8 | 87.8 | 88.3 | 87.4 | 63.5 | 85.4 | 73.2 | 86.2 | 86.7 | 83.9 | 82.5 | 55.6 | 81.7 | 79.0 | 86.6 | 80.0 |
SSD300 | VGG | 77.7 | 82.5 | 83.5 | 75.9 | 70.8 | 49.5 | 85.4 | 86.4 | 88.7 | 61.2 | 82.0 | 78.9 | 85.3 | 86.8 | 84.7 | 79.3 | 54.0 | 75.3 | 79.1 | 86.8 | 78.3 |
FASSD300 | VGG | 79.3 | 86.2 | 84.8 | 77.1 | 75.8 | 54.1 | 85.7 | 87.5 | 89.1 | 61.7 | 85.4 | 77.2 | 86.6 | 88.7 | 86.4 | 79.9 | 54.4 | 79.5 | 80.4 | 88.4 | 76.2 |
Method | Backbone | mAP | FPS | # Proposals | GPU | Input Size |
---|---|---|---|---|---|---|
Faster [36] | VGG | 73.2 | 7 | 6000 | Titan X | ~1000 × 600 |
Faster [37] | Residual-101 | 76.4 | 2.4 | 300 | K40 | ~1000 × 600 |
CSSD300 [38] | VGG | 78.1 | 40.8 | - | Titan X | 300 × 300 |
DSSD321 [5] | Residual-101 | 78.6 | 9.5 | 17,080 | Titan X | 321 × 321 |
DSOD [16] | DS/64-192-48-1 | 77.7 | 17.4 | - | Titan X | 300 × 300 |
FSSD300 [12] | VGG | 78.8 | 65.8 | 8732 | 1080Ti | 300 × 300 |
MDSSD300 [13] | VGG | 78.6 | 38.5 | 31,232 | 1080Ti | 300 × 300 |
RSSD300 [8] | VGG | 78.5 | 35 | 8732 | Titan X | 300 × 300 |
SSD300* [4] | VGG | 77.5 | 46 | 8732 | Titan X | 300 × 300 |
SSD512* [4] | VGG | 79.5 | 19 | 24,564 | Titan X | 512 × 512 |
SSD300 | VGG | 77.7 | 69.7 | 8732 | TITAN RTX | 300 × 300 |
FASSD300 | VGG | 79.3 | 45.3 | 31,232 | TITAN RTX | 300 × 300 |
Method | mAP | FPS |
---|---|---|
SSD300 | 67.3 | 67.6 |
SSD300# | 71.8 | 86.6 |
FASSDv1 | 75.3 | 64.4 |
FASSDv2 | 74.1 | 57.8 |
Method | Transfer Learning | Data Augmentation | mAP |
---|---|---|---|
FASSD300 | ✔ | ✔ | 75.3 |
FASSD300 | ✔ | ✘ | 73.9 |
FASSD300 | ✘ | ✔ | 71.1 |
FAR | Method | E-Small | Small | Medium | Large | E-Large | All |
---|---|---|---|---|---|---|---|
0.05 | SSD300 | 0.178 | 0.910 | 0.979 | 1 | 1 | 0.419 |
SSD300# | 0.244 | 0.970 | 0.979 | 1 | 1 | 0.477 | |
FASSDv1 | 0.378 | 0.978 | 0.979 | 1 | 1 | 0.570 | |
FASSDv2 | 0.380 | 0.978 | 0.979 | 1 | 1 | 0.571 | |
0.10 | SSD300 | 0.289 | 0.955 | 0.979 | 1 | 1 | 0.504 |
SSD300# | 0.430 | 0.978 | 0.979 | 1 | 1 | 0.605 | |
FASSDv1 | 0.521 | 0.978 | 0.979 | 1 | 1 | 0.667 | |
FASSDv2 | 0.486 | 0.978 | 0.979 | 1 | 1 | 0.643 | |
0.20 | SSD300 | 0.418 | 0.970 | 0.979 | 1 | 1 | 0.595 |
SSD300# | 0.538 | 0.985 | 0.979 | 1 | 1 | 0.680 | |
FASSDv1 | 0.608 | 0.978 | 0.979 | 1 | 1 | 0.726 | |
FASSDv2 | 0.592 | 0.978 | 0.979 | 1 | 1 | 0.715 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, D.; Sun, B.; Su, S.; Zuo, Z.; Wu, P.; Tan, X. FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection. Electronics 2020, 9, 1536. https://doi.org/10.3390/electronics9091536
Jiang D, Sun B, Su S, Zuo Z, Wu P, Tan X. FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection. Electronics. 2020; 9(9):1536. https://doi.org/10.3390/electronics9091536
Chicago/Turabian StyleJiang, Deng, Bei Sun, Shaojing Su, Zhen Zuo, Peng Wu, and Xiaopeng Tan. 2020. "FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection" Electronics 9, no. 9: 1536. https://doi.org/10.3390/electronics9091536
APA StyleJiang, D., Sun, B., Su, S., Zuo, Z., Wu, P., & Tan, X. (2020). FASSD: A Feature Fusion and Spatial Attention-Based Single Shot Detector for Small Object Detection. Electronics, 9(9), 1536. https://doi.org/10.3390/electronics9091536