Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks
"> Figure 1
<p>Two similar scenes in content but with different semantic content, left: samples of commercial center; right: samples of dense residential.</p> "> Figure 2
<p>Visual memorability evaluation pipeline for sequence testing.</p> "> Figure 3
<p>Samples with different scene complexity scores.</p> "> Figure 4
<p>Scene complexity score and clustering results for each class of AID-22. It is composed of 3 super-classes according to scene complexity: the low complexity contains 6 classes of scenes, the middle complexity contains 9 classes of scenes, and the high complexity scene contains 7 classes of scenes. Every class contains 360 images, which size is 256 × 256.</p> "> Figure 5
<p>The three modified versions of inception V1. The RFs of the last layer before output are 11 × 11 or 47 × 47 or 83 × 83, separately. Nevertheless, the RFs of original inception V1 is the combination ranged from 11 × 11 to 83 × 83.</p> "> Figure 6
<p>The pipeline for predicting scene complexity of images. The training process contains three phases: firstly, we train resnet-18 pretrained on ImageNet on the constructed scene complexity dataset. Secondly, we use it to extract features. Thirdly, we utilize the extracted features to train the SVM to predict the scene complexity of images.</p> "> Figure 7
<p>The pipeline and CAM calculation. The CAMs represent the contribution of objects on the input images to the prediction, corresponding to the semantic expression. The warmer part means the higher contribution to represent the semantic of an image.</p> "> Figure 8
<p>Performance of different inception models for various complex scenes. Top: simple scene, middle: moderate-complexity scene, and bottom: high-complexity scene. The horizontal axis indicates the dimensions 1 × 1, 3 × 3, 5 × 5, and the original multi-scale inception network, and the vertical axis represents the average accuracy of the model tested on a particular complexity class. To compare the variance of accuracy to complexity in the standard range, we limited the y-axis to the same scale (0.25).</p> "> Figure 9
<p>Results of the performance of different inception models on various scene complexity categories. Warmer colors indicate higher test accuracy; the x-axis represents the scene complexity category, the y-axis represents the model scale, and the z-axis represents the test accuracy. The multi-scale inception model outperforms the other models in high-complexity scene recognition tasks, but some low-complexity scenes’ accuracy worsens.</p> "> Figure 10
<p>Results of the recognition performance of different depth VGG networks on various scene categories. The blue dots indicate the results of the VGG-19 model testing using the dataset, and the red dots indicate the VGG-16 results. The arrow direction shows the change in the accuracy for a particular category.</p> "> Figure 11
<p>Recognition results of different complex scenes for various network depths. The horizontal axis represents the models’ various network depths tested on sets from other complexity classes; the vertical axis represents the overall average accuracy.</p> "> Figure 12
<p>CAMs of four classes in the AID-22. The warmer the color is, the more outstanding the contribution the region makes to recognition. A lower tone indicates a minimal effect on the prediction result. Similarly, the port category is determined by water and its surrounding areas.</p> "> Figure 13
<p>Top-5 CAMs of a sample. From left to right are the top five CAMs in terms of prediction score. GT is the beach category.</p> "> Figure 14
<p>CAMs of the marina, city building, crossroad, and dam. These categories are all sampled from RSI-CB dataset.</p> "> Figure 15
<p>CAMs of different scene complexity scenes. The top row shows low-complexity scene visualization; the middle row shows moderate-complexity scenes; the bottom row shows the high-complexity scenes. For the farmland object, the activation distribution covers the whole region. However, for the baseball field object, the activated areas are significantly concentrated on a few elements.</p> "> Figure 16
<p>The proportion of samples for which corresponding CAMs exhibit a joint multi-object distribution. The horizontal axis represents the scene type, and the vertical axis represents the proportion of samples that respond to multiple objects in the CAMs in each category.</p> "> Figure 17
<p>Occluding an object in a scene. Left: GT is an avenue that the model correctly identifies, and CAM responds in the forest and road areas; right: occlude the forest area and the model incorrectly identifies it as a river-bridge, and the corresponding CAM response area mainly concentrates at the road junction.</p> ">
Abstract
:1. Introduction
- how to measure the inherent properties of images.
- how to analyze the relationship between image properties and features learned from different structures.
- how to make the features learned within the network correspond to the semantic concepts of images for straightforward interpretation.
- We introduce scene complexity and analyze the relationship between remote sensing scenes of different complexity and the scale and hierarchy of feature learning in CNNs.
- We propose a scene complexity measure that integrates scene search difficulty and scene memorability. Besides, we construct the first scene complexity dataset in remote sensing.
- We design a scene complexity prediction framework to adapt different complexity data to the network depth and scale, which effectively improves the downstream model’s recognition accuracy and reduces the number of parameters.
- We visualized and analyzed the relationship between semantic concept representation and model feature learning for scenes of different complexity, showing that complex scenes rely on learning multi-object features jointly to support semantic representation.
2. Materials and Methods
2.1. The Scene Complexity Dataset Construction
- Given an image they were required to answer “yes” or “no” to a question about whether there was a particular object class in the image or point to a location when asked to locate a randomly selected object in the image, for example,” Is there an airplane?” or “Where is the house?”;
- The response time to correctly answer the questions about the image was recorded;
- For each image, the average response time for the two types of question, across all the volunteers, was calculated;
- The sum of the search difficulty score and the memorability score is used as the scene complexity score of an image (Figure 3).
2.2. Methods
2.2.1. How to Control the Scale of Learning Feature
2.2.2. How to Control the Hierarchy of Learning Feature
2.2.3. Designing Image-Adaptive Networks with Scene Complexity
2.2.4. Class Activation Mapping and Semantic Representation
2.3. Training Details
3. Results
3.1. How the Scale of Feature Learning Influences the Recognition of Scenes with Different Complexity
3.2. How the Hierarchy of Feature Learning Influences the Recognition of Scenes with Different Complexity
3.3. How Adaptive Networks Based on Scene Complexity Improve Model’s Performance
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yang, M.Y.; Liao, W.; Ackermann, H.; Rosenhahn, B. On support relations and semantic scene graphs. ISPRS J. Photogramm. Remote Sens. 2017, 131, 15–25. [Google Scholar] [CrossRef] [Green Version]
- Geiger, A.; Lauer, M.; Wojek, C.; Stiller, C.; Urtasun, R. 3d traffic scene understanding from movable platforms. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1012–1025. [Google Scholar] [CrossRef] [Green Version]
- Baek, J.; Chelu, I.V.; Iordache, L.; Paunescu, V.; Ryu, H.; Ghiuta, A.; Petreanu, A.; Soh, Y.; Leica, A.; Jeon, B. Scene understanding networks for autonomous driving based on around view monitoring system. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1074–10747. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems; MIT Press: Las Vegas, NV, USA, 2012; pp. 1097–1105. [Google Scholar]
- Shen, L.; Lin, Z.; Huang, Q. Relay backpropagation for effective learning of deep convolutional neural networks. In European Conference on Computer Vision; Springer: Amsterdam, The Netherlands, 2016; pp. 467–482. [Google Scholar]
- Alshehhi, R.; Marpu, P.R.; Woon, W.L.; Dalla Mura, M. Simultaneous extraction of roads and buildings in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2017, 130, 139–149. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef] [Green Version]
- Jain, A.K.; Ratha, N.K.; Lakshmanan, S. Object detection using Gabor filters. Pattern Recognit. 1997, 30, 295–309. [Google Scholar] [CrossRef]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
- Khan, A.; Sohail, A.; Zahoora, U.; Qureshi, A.S. A survey of the recent architectures of deep convolutional neural networks. Artif. Intell. Rev. 2020, 53, 5455–5516. [Google Scholar] [CrossRef] [Green Version]
- Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef] [Green Version]
- Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
- Chen, C.; Gong, W.; Chen, Y.; Li, W. Object Detection in Remote Sensing Images Based on a Scene-Contextual Feature Pyramid Network. Remote Sens. 2019, 11, 339. [Google Scholar] [CrossRef] [Green Version]
- Qu, H.; Zhang, L.; Wu, X.; He, X.; Hu, X.; Wen, X. Multiscale Object Detection in Infrared Streetscape Images Based on Deep Learning and Instance Level Data Augmentation. Appl. Sci. 2019, 9, 565. [Google Scholar] [CrossRef] [Green Version]
- Liu, H.; Li, J.; He, L.; Wang, Y. Superpixel-Guided Layer-Wise Embedding CNN for Remote Sensing Image Classification. Remote Sens. 2019, 11, 174. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Wu, L.; Xie, Z.; Chen, Z. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters. Remote Sens. 2018, 10, 144. [Google Scholar] [CrossRef] [Green Version]
- Egli, S.; Höpke, M. CNN-Based Tree Species Classification Using High Resolution RGB Image Data from Automated UAV Observations. Remote Sens. 2020, 12, 3892. [Google Scholar] [CrossRef]
- Taoufiq, S.; Nagy, B.; Benedek, C. HierarchyNet: Hierarchical CNN-Based Urban Building Classification. Remote Sens. 2020, 12, 3794. [Google Scholar] [CrossRef]
- Liu, Y.; Suen, C.Y.; Liu, Y.; Ding, L. Scene classification using hierarchical Wasserstein CNN. IEEE Trans. Geosci. Remote Sens. 2018, 57, 2494–2509. [Google Scholar] [CrossRef]
- Feixas, M.; Acebo, E.D.; Bekaert, P.; Sbert, M. An Information Theory Framework for the Analysis of Scene Complexity. Comput. Graph. Forum 2010, 18, 95–106. [Google Scholar] [CrossRef]
- Moosmann, F.; Larlus, D.; Jurie, F. Learning saliency maps for object categorization. In Proceedings of the Eccv’06 Workshop on the Representation & Use of Prior Knowledge in Vision, Graz, Austria, 7–13 May 2006. [Google Scholar]
- Tian, M.; Wan, S.; Yue, L. A Novel Approach for Change Detection in Remote Sensing Image Based on Saliency Map. In Computer Graphics, Imaging and Visualisation; IEEE: Bangkok, Thailand, 2007; pp. 397–402. [Google Scholar]
- Isola, P.; Xiao, J.; Parikh, D.; Torralba, A.; Oliva, A. What Makes a Photograph Memorable? IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1469–1482. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ionescu, R.T.; Alexe, B.; Leordeanu, M.; Popescu, M.; Papadopoulos, D.P.; Ferrari, V. How Hard Can It Be? Estimating the Difficulty of Visual Search in an Image. In Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2157–2166. [Google Scholar]
- Ayromlou, M.; Zillich, M.; Ponweiser, W.; Vincze, M. Measuring scene complexity to adapt feature selection of model-based object tracking. In International Conference on Computer Vision Systems; Springer: Nice, France, 2003; pp. 448–459. [Google Scholar]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In European Conference on Computer Vision; Springer: Zurich, Switzerland, 2014; pp. 818–833. [Google Scholar]
- Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 2013, arXiv:1312.6034. [Google Scholar]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
- Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the 2010 IEEE conference on Computer vision and pattern recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 1–9. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
- Liu, R.; Lehman, J.; Molino, P.; Such, F.P.; Frank, E.; Sergeev, A.; Yosinski, J. An intriguing failing of convolutional neural networks and the coordconv solution. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 9605–9616. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object Detectors Emerge in Deep Scene CNNs. arXiv 2014, arXiv:1412.6856. [Google Scholar]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica 2012, 22, 276–282. [Google Scholar] [CrossRef]
- Li, H.; Dou, X.; Tao, C.; Wu, Z.; Chen, J.; Peng, J.; Deng, M.; Zhao, L. RSI-CB: A Large-Scale Remote Sensing Image Classification Benchmark Using Crowdsourced Data. Sensors 2020, 20, 1594. [Google Scholar] [CrossRef] [Green Version]
ConvNet Configuration | ||||
---|---|---|---|---|
VGG-16 | VGG-16*-A | VGG-16*-B | VGG-16*-C | VGG-16*-D |
Conv3-64 Conv3-64 | ||||
Maxpool | ||||
Conv3-128 Conv3-128 | ||||
Maxpool | ||||
Conv3-256 Conv3-256 Conv3-256 | ||||
Maxpool | Maxpool | Maxpool|GAP | Maxpool|GAP | Maxpool|GAP |
\ | FC-1024 FC-6 | FC-1024 FC-6 | FC-1024 FC-6 | FC-1024 FC-6 |
Conv3-512 Conv3-512 Conv3-512 | ||||
Maxpool | Maxpool | Maxpool | Maxpool|GAP | Maxpool|GAP |
\ | FC-1024 FC-9 | FC-1024 FC-9 | FC-1024 FC-9 | FC-1024 FC-9 |
Conv3-512 Conv3-512 Conv3-512 | ||||
Maxpool | Maxpool | Maxpool | Maxpool | Maxpool|GAP |
FC-4096 FC-4096 FC-22 soft-max | FC-1024 FC-7 soft-max | FC-1024 FC-7 soft-max | FC-1024 FC-7 soft-max | FC-1024 FC-7 soft-max |
Model | GoogLeNet | Inception 1 × 1 | Inception 3 × 3 | Inception 5 × 5 |
---|---|---|---|---|
OA | 0.8329 | 0.6863 | 0.7761 | 0.7815 |
Kappa | 0.8269 | 0.6708 | 0.7651 | 0.7706 |
Model | Train Accuracy (%) | Test Accuracy (%) |
---|---|---|
VGG-16 | 96.15 | 94.1 |
VGG-19 | 96.89 | 94.91 |
Model | Train Accuracy (%) | Test Accuracy (%) | Iteration | Parameters |
---|---|---|---|---|
VGG-16 | 96.15 | 94.10 | 200,000 | 134M |
VGG-16*-A | 97.56 | 94.51 | 240,000 | 343M |
VGG-16*-B | 97.89 | 95.42 | 150,000 | 138M |
VGG-16*-C | 98.11 | 95.80 | 100,000 | 37M |
VGG-16*-D | 97.20 | 94.86 | 80,000 | 12M |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, J.; Mei, X.; Li, W.; Hong, L.; Sun, B.; Li, H. Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks. Remote Sens. 2021, 13, 742. https://doi.org/10.3390/rs13040742
Peng J, Mei X, Li W, Hong L, Sun B, Li H. Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks. Remote Sensing. 2021; 13(4):742. https://doi.org/10.3390/rs13040742
Chicago/Turabian StylePeng, Jian, Xiaoming Mei, Wenbo Li, Liang Hong, Bingyu Sun, and Haifeng Li. 2021. "Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks" Remote Sensing 13, no. 4: 742. https://doi.org/10.3390/rs13040742
APA StylePeng, J., Mei, X., Li, W., Hong, L., Sun, B., & Li, H. (2021). Scene Complexity: A New Perspective on Understanding the Scene Semantics of Remote Sensing and Designing Image-Adaptive Convolutional Neural Networks. Remote Sensing, 13(4), 742. https://doi.org/10.3390/rs13040742