Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification
"> Figure 1
<p>A categorization of the methods for aerial scene classification.</p> "> Figure 2
<p>The texture coded two-stream deep architecture. The raw RGB network stream takes RGB images as input. The mapped LBP coded network stream takes mapped LBP images as input. Two differnet sets of features are fused using the two-stream deep feature fusion model.</p> "> Figure 3
<p>The GoogLeNet architecture used in this paper. In the inception module, the convolutions with different sizes can make the network process the input features at different scales.</p> "> Figure 4
<p>The saliency coded two-stream deep architecture. The raw RGB network stream takes RGB images as input. The saliency coded network stream takes the processed images through saliency detection as input. Two different sets of features are fused using the two-stream deep feature fusion model.</p> "> Figure 5
<p>The overall architecture of the two-stream deep feature fusion model.</p> "> Figure 6
<p>The architecture of the basic module in the two-stream deep feature fusion model.</p> "> Figure 7
<p>The detailed construction of the two-stream deep feature fusion model.</p> "> Figure 8
<p>Class representatives of the UC-Merced dataset: (1) Agricultural; (2) Airplane; (3) Baseball diamond; (4) Beach; (5) Buildings; (6) Chaparral; (7) Dense residential; (8) Forest; (9) Freeway; (10) Golf course; (11) Harbor; (12) Intersection; (13) Medium residential; (14) Mobile home park; (15) Overpass; (16) Parking lot; (17) River; (18) Runway; (19) Sparse residential; (20) Storage tanks; (21) Tennis court.</p> "> Figure 9
<p>Class representatives of the Aerial Image Dataset (AID) dataset: (1) Airport; (2) Bare land; (3) Baseball field; (4) Beach; (5) Bridge; (6) Center; (7) Church; (8) Commercial; (9) Dense residential; (10) Desert; (11) Farmland; (12) Forest; (13) Industrial; (14) Meadow; (15) Medium residential; (16) Mountain; (17) Park; (18) Parking; (19) Playground; (20) Pond; (21) Port; (22) Railway station; (23) Resort; (24) River; (25) School; (26) Sparse residential; (27) Square; (28) Stadium; (29) Storage tanks; (30) Viaduct.</p> "> Figure 10
<p>Class representatives of the NWPU-RESISC45 dataset: (1) Airplane; (2) Airport; (3) Baseball diamond; (4) Basketball court; (5) Beach; (6) Bridge; (7) Chaparral; (8) Church; (9) Circular farmland; (10) Cloud; (11) Commercial area; (12) Dense residential; (13) Desert; (14) Forest; (15) Freeway; (16) Golf course; (17) Ground track field; (18) Harbor; (19) Industrial area; (20) Intersection; (21) Island; (22) Lake; (23) Meadow; (24) Medium residential; (25) Mobile home park; (26) Mountain; (27) Overpass; (28) Palace; (29) Parking lot; (30) Railway; (31) Railway station; (32) Rectangular farmland; (33) River; (34) Roundabout; (35) Runway; (36) Sea ice; (37) Ship; (38) Snowberg; (39) Sparse residential; (40) Stadium; (41) Storage tank; (42) Tennis court; (43) Terrace; (44) Thermal power station; (45) Wetland.</p> "> Figure 11
<p>Confusion matrixes of the UC-Merced dataset under the training ratio of 80% using the following methods. (<b>a</b>) texture coded two-stream architecture with our proposed fusion model; (<b>b</b>) saliency coded two-stream architecture with our proposed fusion model.</p> "> Figure 12
<p>Confusion matrixes of the AID dataset under the training ratio of 50% using the following methods. (<b>a</b>) texture coded two-stream architecture with our proposed fusion model; (<b>b</b>) saliency coded two-stream architecture with our proposed fusion model.</p> "> Figure 13
<p>Confusion matrixes of the NWPU-RESISC45 dataset under the training ratio of 20% using the following methods. (<b>a</b>) texture coded two-stream architecture with our proposed fusion model; (<b>b</b>) saliency coded two-stream architecture with our proposed fusion model.</p> ">
Abstract
:1. Introduction
2. Related Works
2.1. Methods for Aerial Scene Classification
- (1)
- The approaches in the first group use the pre-trained CNNs to extract features from the input aerial scene images. In this group, all CNN models are pre-trained on the ImageNet 2012 dataset [49]. Penatti et al. [45] demonstrated the generalization ability of CNN models, i.e., OverFeat [50] and CaffeNet [20], in the scenario of aerial scene classification. Recently, two new large-scale aerial scene image datasets, i.e., Aerial Image Dataset (AID) [51] and Northwestern Polytechnical University-REmote Sensing Image Scene Classification 45 (NWPU-RESISC45) [52], have been constructed, and pre-trained CNN models achieved higher accuracies over the low- and mid-level features based methods.
- (2)
- The methods in the second group train the CNN models from scratch, in which the filter weights are all randomly initialized and then trained for the target remote sensing scene datasets. In [53,54], the authors investigated the performance of the trained-from-scratch CNN models in the field of aerial scene classification, and the results showed that the trained-from-scratch CNN models get a drop in classification accuracy compared to the pre-trained CNN models. The authors thought that the relatively low performance of the trained-from-scratch CNN models was mainly due to the limited training data.
- (3)
- The methods in the third group fine-tune the pre-trained CNN models on the target aerial scene datasets and use the fine-tuned architectures to extract features for classification. In [52,53,54], the authors fine-tuned some popular-used CNN models, and the experimental results pointed that the fine-tuning strategy can help the CNN models get much higher classification accuracies than both the full-training strategy and the “using pre-trained CNN models as feature extractors” strategy. In addition, Liu et al. [55] proposed a novel scene classification method through triplet networks, in which the triplet networks were pre-trained on ImageNet [49] and followed by fine-tuning over the target datasets. Their triplet networks reported a state-of-the-art performance in aerial scene classification tasks.
- (4)
- The methods in the fourth group generate the final image representation by reprocessing the features extracted from the pre-trained CNN models. In [56], the multiscale dense convolutional features extracted from pre-trained CNNs were fed into four main parts, i.e., visual words encoding, correlogram extraction, correlation encoding, and classification, and state-of-the-art results were achieved. Cheng et al. [57] proposed a feature representation method for remote sensing image scene classification, termed bag of convolutional features (BoCF), which encodes the convolutional feature descriptors. Moreover, Yuan et al. [58] encoded the extracted deep feature by using the locality-constrained affine subspace coding (LASC) method, which can obtain more discriminative deep features than directly extracted from CNN models. In addition, Liu et al. [59] concatenated the extracted convolutional features to generate the deeply local descriptors, and subsequently, selected a feature encoding method, i.e., Fisher encoding with Gaussian mixture model (GMM) clustering, to process the deeply local descriptors. In another work, Liu et al. [60] proposed a linear transformation of deep features, in which the discriminative convolution filter (DCF) learning approach was performed on the local patches obtained from raw deep features.
- (5)
- The methods in the fifth group use some fusion technologies to conduct the aerial scene classification. Anwer et al. [61] constructed a texture coded two-stream deep architecture which fuses both raw RGB features and texture coded features. However, its fusion approach is based on conventional concatenation strategy. Moreover, Chaib et al. [62] proposed to use discriminant correlation analysis (DCA) to process the extracted two sets of features, and combine the processed features through conventional fusion strategy, i.e., concatenation and addition. Li et al. [63] proposed to use a PCA/spectral regression kernel discriminant analysis (SRKDA) method to fuse the multiscale convolutional features encoded by a multiscale improved Fisher kernel coding method and the features of fully connected layers. In addition, Ye et al. [64] utilized the features from intermediate layers, and subsequently, created a parallel multi-stage (PMS) architecture formed by three sub-models, i.e., the low CNN, the middle CNN and the high CNN. The study [65] adaptively combined the features from lower layers and fully connected layers, in which the feature fusion operation was performed via a linear combination instead of concatenation. Their classification results all showed significant advantages over those “stand-alone” approaches. At the same time, some score-level fusion approaches [11,66] were proposed for aerial scene classification, which can also achieve impressive performance on the publicly available remote sensing scene datasets.
2.2. Feature Fusion
3. Proposed Architecture
3.1. Mapping LBP Codes
3.2. Saliency Detection
3.3. Two-Stream Deep Architecture
3.3.1. Texture Coded Two-Stream Deep Architecture
3.3.2. Saliency Coded Two-Stream Deep Architecture
3.4. Two-Stream Deep Feature Fusion Model
4. Experimental Design and Results
4.1. Description of the Utilized Datasets
4.1.1. UC-Merced Dataset
4.1.2. AID Dataset
4.1.3. NWPU-RESISC45 Dataset
4.2. Experimental Setup
4.3. Experimental Results and Analysis
4.3.1. Comparison with the Baseline Methods
- (1)
- For the concatenation fusion strategy, the classification accuracies of saliency coded two-stream architecture achieve 94.46%, 96.17%, 89.15%, 91.25%, 79.69%, and 81.46% on the UC-Merced dataset (50% and 80% training samples), the AID dataset (20% and 50% training samples), and the NWPU-RESISC45 dataset (10% and 20% training samples), respectively, which are higher than the results of texture coded two-stream architecture, i.e., 94.31%, 95.71%, 88.46%, 90.15%, 79.58%, and 81.13%. At the same time, the saliency coded two-stream architecture also performs better than texture coded two-stream architecture with regard to the other two fusion strategies. The reason is mainly that the method of saliency detection has the ability of focusing more attention on the image regions that are most informative and dominate the class. Therefore, the features extracted from the saliency coded network stream are more informative and significant than that from the mapped LBP coded network stream.
- (2)
- For the texture coded two-stream architecture, the classification accuracies of using our proposed feature fusion strategy can rank 97.55%, 98.40%, 93.31%, 95.17%, 84.77%, and 86.36% on the aforementioned datasets, respectively, which are higher than 95.25%, 96.62%, 88.56%, 90.29%, 79.63%, and 81.22%, obtained by applying the addition fusion strategy. The method with the concatenation fusion strategy has the lowest classification accuracies, i.e., 94.31%, 95.71%, 88.46%, 90.15%, 79.58%, and 81.13%. Meanwhile, we can get the same comparison results in the saliency coded two-stream architecture, in which the concatenation fusion strategy takes the third place, the addition fusion strategy takes the second place, and our proposed fusion model has the highest accuracies. It can be seen from the comparison results that our proposed deep feature fusion model can enhance the representational ability of the extracted features and improve the classification performance.
- (3)
- The saliency coded two-stream deep architecture with our proposed feature fusion model outperforms all the other baseline approaches over these three utilized datasets, having the highest accuracies.
- (4)
- The overall accuracies on the UC-Merced dataset are almost saturated. This is because this dataset is not very rich in terms of image variations. The AID dataset and the NWPU-RESISC45 dataset have higher intra-class variations and smaller inter-class dissimilarity; therefore, the results on them are still more challenging.
4.3.2. Confusion Analysis
- (1)
- Over the UC-Merced dataset (see Figure 11), most of the scene categories can achieve the classification accuracy close to or even equal to 1. In the confusion matrix of our feature fusion model based texture coded two-stream architecture, categories with classification accuracy lower than 1 include agriculture (0.95), dense residential (0.95), intersection (0.95), medium residential (0.95), river (0.9) and tennis court (0.95). In the confusion matrix of our feature fusion model based saliency coded two-stream architecture, categories with classification accuracy lower than 1 include buildings (0.95), dense residential (0.9), golf course (0.95) and tennis court (0.95). The scene categories in the confusion matrix of saliency coded two-stream architecture with our proposed fusion model obtain a better performance compared to the confusion matrix of the other method. For example, the agriculture, intersection, medium residential, and river scenes, which are confused in texture coded two-stream architecture with our proposed fusion model, are fully recognized by saliency coded two-stream architecture with our proposed fusion model.
- (2)
- Over the AID dataset (see Figure 12), it should be noted that our methods can get the classification accuracy rate of more than 0.95 under most scene categories, including bare land (a: 0.96, b: 0.96), baseball field (a: 1, b: 0.99), beach (a: 0.98, b: 1), bridge (a: 0.97, b: 0.98), desert (a: 0.97, b: 0.97), farmland (a: 0.96, b: 0.98), forest (a: 0.99, b: 0.99), meadow (a 0.99, b: 0.99), mountain (a: 0.99, b: 0.99), parking (a: 0.99, b: 1), playground (a: 0.98, b: 0.98), pond (a: 0.98, b: 0.99), sparse residential (a: 1, b: 1), port (a: 0.97, b: 0.98), river (a: 0.96, b: 0.99), stadium (a: 0.97, b: 0.99), storage tanks (a: 0.98, b: 0.96), and viaduct (a: 0.99, b: 1). At the same time, our methods can obtain relatively good performance under some scene categories that are difficult to recognize. For instance, from the results obtained by using the multilevel fusion method [66], we can see the scene categories with low accuracy, i.e., school (0.77), square (0.80), resort (0.74), and center (0.80). In Figure 12a, the classification accuracy rates of these four scene categories are improved by our method, i.e., school (0.83), square (0.82), resort (0.81), and center (0.85). In Figure 12b, the results about these four scene classes are listed as follows: school (0.83), square (0.85), resort (0.80), and center (0.89).
- (3)
- Over the NWPU-RESISC45 dataset (see Figure 13), compared with the confusion matrix of BoCF (VGGNet-16) [57], our approaches provide consistent improvement in performance on most scene categories, i.e., commercial area (BoCF (VGGNet-16): 0.68, a: 0.76, b: 0.76), freeway (BoCF (VGGNet-16): 0.69, a: 0.73, b: 0.81), tennis court (BoCF (VGGNet-16): 0.57, a: 0.72, b: 0.74), palace (BoCF (VGGNet-16): 0.46, a: 0.61, b: 0.53), etc.
5. Discussion
- (1)
- On the UC-Merced dataset (see Table 5), our best architecture outperforms all the other aerial scene classification approaches with an increase in overall accuracy of 1.54%, 0.57% over the second best methods, Aggregate strategy 2 [59] and SHHTFM [95], using 50% and 80% training ratios, respectively.
- (2)
- On the AID dataset (see Table 6), our best architecture performs better than the state-of-the-art approaches and the margins are generally rather large. Our accuracies are higher than the second best results, i.e., TEX-Net-LF [61] and Multilevel fusion [66], by 3.22% and 1.82%, under the training ratios of 20% and 50%, respectively.
- (3)
- On the NWPU-RESISC45 dataset (see Table 7), our best architecture remarkably improves the performance when compared with the state-of-the-art results. Specifically, our method gains a large margin of overall accuracy improvements with 2.37%, 2.69% over the second best method, BoCF (VGGNet-16) [57], using 10% and 20% labeled samples per class as training ratio, respectively.
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Faisal, A.; Kafy, A.; Roy, S. Integration of Remote Sensing and GIS Techniques for Flood Monitoring and Damage Assessment: A Case Study of Naogaon District. Bangladesh J. Remote Sens. GIS 2018, 7, 2. [Google Scholar] [CrossRef]
- Bi, S.; Lin, X.; Wu, Z.; Yang, S. Development technology of principle prototype of high-resolution quantum remote sensing imaging. In Quantum Sensing and Nano Electronics and Photonics XV; International Society for Optics and Photonics: Bellingham, WA, USA, 2018; Volume 10540, p. 105400Q. [Google Scholar]
- Weng, Q.; Quattrochi, D.; Gamba, P.E. Urban Remote Sensing; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
- Mukherjee, A.B.; Krishna, A.P.; Patel, N. Application of Remote Sensing Technology, GIS and AHP-TOPSIS Model to Quantify Urban Landscape Vulnerability to Land Use Transformation. In Information and Communication Technology for Sustainable Development; Springer: Singapore, 2018; pp. 31–40. [Google Scholar]
- Yang, Y.; Newsam, S. Geographic image retrieval using local invariant features. IEEE Trans. Geosci. Remote Sens. 2013, 51, 818–832. [Google Scholar] [CrossRef]
- Zheng, X.; Sun, X.; Fu, K.; Wang, H. Automatic annotation of satellite images via multifeature joint sparse coding with spatial relation constraint. IEEE Geosci. Remote Sens. Lett. 2013, 10, 652–656. [Google Scholar] [CrossRef]
- Hu, J.; Xia, G.S.; Hu, F.; Zhang, L. A comparative study of sampling analysis in the scene classification of optical high-spatial resolution remote sensing imagery. Remote Sens. 2015, 7, 14988–15013. [Google Scholar] [CrossRef]
- Ammour, N.; Bashmal, L.; Bazi, Y.; Al Rahhal, M.; Zuair, M. Asymmetric Adaptation of Deep Features for Cross-Domain Classification in Remote Sensing Imagery. IEEE Geosci. Remote Sens. Lett. 2018, 15, 597–601. [Google Scholar] [CrossRef]
- Alhichri, H.; Othman, E.; Zuair, M.; Ammour, N.; Bazi, Y. Tile-Based Semisupervised Classification of Large-Scale VHR Remote Sensing Images. J. Sens. 2018, 2018. [Google Scholar] [CrossRef]
- Banerjee, B.; Chaudhuri, S. Scene Recognition From Optical Remote Sensing Images Using Mid-Level Deep Feature Mining. IEEE Geosci. Remote Sens. Lett. 2018, 15. [Google Scholar] [CrossRef]
- Minetto, R.; Segundo, M.P.; Sarkar, S. Hydra: An Ensemble of Convolutional Neural Networks for Geospatial Land Classification. arXiv, 2018; arXiv:1802.03518. [Google Scholar]
- Yang, Y.; Newsam, S. Comparing SIFT descriptors and Gabor texture features for classification of remote sensed imagery. In Proceedings of the 15th IEEE International Conference on Image Processing (ICIP 2008), San Diego, CA, USA, 12–15 October 2008; pp. 1852–1855. [Google Scholar]
- Dos Santos, J.A.; Penatti, O.A.B.; da Silva Torres, R. Evaluating the Potential of Texture and Color Descriptors for Remote Sensing Image Retrieval and Classification. In Proceedings of the Fifth International Conference on Computer Vision Theory and Applications, Angers, France, 17–21 May 2010; Volume 2, pp. 203–208. [Google Scholar]
- Zhao, L.; Tang, P.; Huo, L. A 2-D wavelet decomposition-based bag-of-visual-words model for land-use scene classification. Int. J. Remote Sens. 2014, 35, 2296–2310. [Google Scholar]
- Chen, S.; Tian, Y. Pyramid of spatial relatons for scene-level land use classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1947–1957. [Google Scholar] [CrossRef]
- Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
- Luus, F.P.; Salmon, B.P.; Van den Bergh, F.; Maharaj, B.T.J. Multiview deep learning for land-use classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2448–2452. [Google Scholar] [CrossRef]
- Chen, J.; Wang, C.; Ma, Z.; Chen, J.; He, D.; Ackland, S. Remote Sensing Scene Classification Based on Convolutional Neural Networks Pre-Trained Using Attention-Guided Sparse Filters. Remote Sens. 2018, 10, 290. [Google Scholar] [CrossRef]
- Chew, R.F.; Amer, S.; Jones, K.; Unangst, J.; Cajka, J.; Allpress, J.; Bruhn, M. Residential scene classification for gridded population sampling in developing countries using deep convolutional neural networks on satellite imagery. Int. J. Health Geogr. 2018, 17, 12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks; Neural Information Processing Systems Foundation, Inc.: La Jolla, CA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas Valley, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Zhu, X.X.; Tuia, D.; Mou, L.; Xia, G.S.; Zhang, L.; Xu, F.; Fraundorfer, F. Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–36. [Google Scholar] [CrossRef] [Green Version]
- Ball, J.E.; Anderson, D.T.; Chan, C.S. Comprehensive survey of deep learning in remote sensing: Theories, tools, and challenges for the community. J. Appl. Remote Sens. 2017, 11, 042609. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep learning for remote sensing image classification: A survey. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery; John Wiley & Sons: Hoboken, NJ, USA, 2018; p. p.e1264. [Google Scholar]
- Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef] [Green Version]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Swain, M.J.; Ballard, D.H. Color indexing. Int. J. Comput. Vis. 1991, 7, 11–32. [Google Scholar] [CrossRef]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Luo, B.; Jiang, S.; Zhang, L. Indexing of remote sensing images with different resolutions by multiple features. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2013, 6, 1899–1912. [Google Scholar] [CrossRef]
- Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
- Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1465–1472. [Google Scholar]
- Shao, W.; Yang, W.; Xia, G.S.; Liu, G. A hierarchical scheme of multiple feature fusion for high-resolution satellite scene categorization. In Computer Vision Systems, Proceedings of the 9th International Conference, ICVS 2013, St. Petersburg, Russia, 16–18 July 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 324–333. [Google Scholar]
- Zhao, L.J.; Tang, P.; Huo, L.Z. Land-use scene classification using a concentric circle-structured multiscale bag-of-visual-words model. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2014, 7, 4620–4631. [Google Scholar] [CrossRef]
- Sridharan, H.; Cheriyadat, A. Bag of lines (BoL) for improved aerial scene representation. IEEE Geosci. Remote Sens. Lett. 2015, 12, 676–680. [Google Scholar] [CrossRef]
- Hu, J.; Jiang, T.; Tong, X.; Xia, G.S.; Zhang, L. A benchmark for scene classification of high spatial resolution remote sensing imagery. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 5003–5006. [Google Scholar]
- Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 2169–2178. [Google Scholar]
- Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; Gong, Y. Locality-constrained linear coding for image classification. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’10), San Francisco, CA, USA, 13–18 June 2010; pp. 3360–3367. [Google Scholar]
- Bosch, A.; Zisserman, A.; Muñoz, X. Scene classification via pLSA. In Computer Vision—ECCV 2006: Proceeding of the 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 517–530. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Perronnin, F.; Sánchez, J.; Mensink, T. Improving the fisher kernel for large-scale image classification. In Computer Vision—ECCV 2006, Proceeding of the 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 143–156. [Google Scholar]
- Jegou, H.; Perronnin, F.; Douze, M.; Sánchez, J.; Perez, P.; Schmid, C. Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1704–1716. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Penatti, O.A.; Nogueira, K.; dos Santos, J.A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7 June 2015; pp. 44–81. [Google Scholar]
- Zhang, F.; Du, B.; Zhang, L. Scene classification via a gradient boosting random convolutional network framework. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1793–1802. [Google Scholar] [CrossRef]
- Aryal, J.; Dutta, R. Smart city and geospatiality: Hobart deeply learned. In Proceeding of the 2015 31st IEEE International Conference on Data Engineering Workshops (ICDEW), Seoul, Korea, 13–17 April 2015; pp. 108–109. [Google Scholar]
- Dutta, R.; Aryal, J.; Das, A.; Kirkpatrick, J.B. Deep cognitive imaging systems enable estimation of continental-scale fire incidence from climate data. Sci. Rep. 2013, 3, 3188. [Google Scholar] [CrossRef] [PubMed]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv, 2013; arXiv:1312.6229. [Google Scholar]
- Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
- Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land use classification in remote sensing images by convolutional neural networks. arXiv, 2015; arXiv:1508.00092. [Google Scholar]
- Nogueira, K.; Penatti, O.A.; dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Huang, C. Scene classification via triplet networks. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2018, 11, 220–237. [Google Scholar] [CrossRef]
- Qi, K.; Yang, C.; Guan, Q.; Wu, H.; Gong, J. A Multiscale Deeply Described Correlatons-Based Model for Land-Use Scene Classification. Remote Sens. 2017, 9, 917. [Google Scholar] [CrossRef]
- Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote sensing image scene classification using bag of convolutional features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1735–1739. [Google Scholar] [CrossRef]
- Yuan, B.; Li, S.; Li, N. Multiscale deep features learning for land-use scene recognition. J. Appl. Remote Sens. 2018, 12, 015010. [Google Scholar] [CrossRef]
- Liu, N.; Wan, L.; Zhang, Y.; Zhou, T.; Huo, H.; Fang, T. Exploiting Convolutional Neural Networks with Deeply Local Description for Remote Sensing Image Classification. IEEE Access 2018, 6, 11215–11228. [Google Scholar] [CrossRef]
- Liu, N.; Lu, X.; Wan, L.; Huo, H.; Fang, T. Improving the Separability of Deep Features with Discriminative Convolution Filters for RSI Classification. ISPRS Int. J. Geo-Inf. 2018, 7, 95. [Google Scholar] [CrossRef]
- Anwer, R.M.; Khan, F.S.; van de Weijer, J.; Molinier, M.; Laaksonen, J. Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification. arXiv, 2017; arXiv:1706.01171. [Google Scholar] [CrossRef]
- Chaib, S.; Liu, H.; Gu, Y.; Yao, H. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4775–4784. [Google Scholar] [CrossRef]
- Li, E.; Xia, J.; Du, P.; Lin, C.; Samat, A. Integrating Multilayer Features of Convolutional Neural Networks for Remote Sensing Scene Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 5653–5665. [Google Scholar] [CrossRef]
- Ye, L.; Wang, L.; Sun, Y.; Zhao, L.; Wei, Y. Parallel multi-stage features fusion of deep convolutional neural networks for aerial scene classification. Remote Sens. Lett. 2018, 9, 295–304. [Google Scholar] [CrossRef]
- Liu, Y.; Liu, Y.; Ding, L. Scene Classification Based on Two-Stage Deep Feature Fusion. IEEE Geosci. Remote Sens. Lett. 2018, 15, 183–186. [Google Scholar] [CrossRef]
- Yu, Y.; Liu, F. Aerial Scene Classification via Multilevel Fusion Based on Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 287–291. [Google Scholar] [CrossRef]
- Chowdhury, A.R.; Lin, T.Y.; Maji, S.; Learned-Miller, E. One-to-many face recognition with bilinear cnns. arXiv, 2015; arXiv:1506.01342. [Google Scholar]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear cnn models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Las Vegas Valley, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Park, E.; Han, X.; Berg, T.L.; Berg, A.C. Combining multiple sources of knowledge in deep cnns for action recognition. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–9 March 2016; pp. 1–8. [Google Scholar]
- Wu, Z.; Wang, X.; Jiang, Y.G.; Ye, H.; Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 461–470. [Google Scholar]
- Bodla, N.; Zheng, J.; Xu, H.; Chen, J.C.; Castillo, C.; Chellappa, R. Deep heterogeneous feature fusion for template-based face recognition. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 586–595. [Google Scholar]
- Levi, G.; Hassner, T. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 Novmber 2015; pp. 503–510. [Google Scholar]
- Borg, I.; Groenen, P.J. Modern Multidimensional Scaling: Theory and Applications; Springer Science & Business Media: Berlin, Germany, 2005. [Google Scholar]
- Seber, G.A. Multivariate Observations; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 252. [Google Scholar]
- Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 2000, 40, 99–121. [Google Scholar] [CrossRef]
- Moore, T.; Zirnsak, M. Neural mechanisms of selective visual attention. Annual Rev. Psychol. 2017, 68, 47–72. [Google Scholar] [CrossRef] [PubMed]
- Bauer, A.; Schneider, S.; Waldorf, M.; Braks, K.; Huber, T.J.; Adolph, D.; Vocks, S. Selective visual attention towards oneself and associated state body satisfaction: An eye-tracking study in adolescents with different types of eating disorders. J. Abnormal Child. Psychol. 2017, 45, 1647–1661. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Z.; Zhang, T.; Yan, L. Saliency model for object detection: Searching for novel items in the scene. Opt. Lett. 2012, 37, 1580–1582. [Google Scholar] [CrossRef] [PubMed]
- Ranganath, C.; Rainer, G. Cognitive neuroscience: Neural mechanisms for detecting and remembering novel events. Nat. Rev. Neurosci. 2003, 4, 193. [Google Scholar] [CrossRef] [PubMed]
- Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 3. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015; arXiv:1502.03167. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (ICAIS), Klagenfurt, Austria, 6–8 September 2011; pp. 315–323. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 2016; arXiv:1603.04467. [Google Scholar]
- Steinwart, I.; Christmann, A. Support Vector Machines; Springer Science & Business Media: Berlin, Germany, 2008. [Google Scholar]
- Weng, Q.; Mao, Z.; Lin, J.; Guo, W. Land-use classification via extreme learning classifier based on deep convolutional features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 704–708. [Google Scholar] [CrossRef]
- Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2014, 52, 439–451. [Google Scholar] [CrossRef]
- Zhang, F.; Du, B.; Zhang, L. Saliency-guided unsupervised feature learning for scene classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 2175–2184. [Google Scholar] [CrossRef]
- Ren, J.; Jiang, X.; Yuan, J. Learning LBP structure by maximizing the conditional mutual information. Pattern Recognit. 2015, 48, 3180–3190. [Google Scholar] [CrossRef]
- Negrel, R.; Picard, D.; Gosselin, P.H. Evaluation of second-order visual features for land-use classification. In Proceedings of the 2014 12th International Workshop onContent-Based Multimedia Indexing (CBMI 2014), Klagenfurt, Austria, 18–20 June 2014; pp. 1–5. [Google Scholar]
- Huang, L.; Chen, C.; Li, W.; Du, Q. Remote sensing image scene classification using multi-scale completed local binary patterns and fisher vectors. Remote Sens. 2016, 8, 483. [Google Scholar] [CrossRef]
- Ji, W.; Li, X.; Lu, X. Bidirectional Adaptive Feature Fusion for Remote Sensing Scene Classification. In Computer Vision, Proccedings of the Second CCF Chinese Conference, CCCV 2017, Tianjin, China, 11–14 October 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 486–497. [Google Scholar]
- Bian, X.; Chen, C.; Tian, L.; Du, Q. Fusing local and global features for high-resolution scene classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2017, 10, 2889–2901. [Google Scholar] [CrossRef]
- Gu, X.; Angelov, P.P.; Zhang, C.; Atkinson, P.M. A massively parallel deep rule-based ensemble classifier for remote sensing scenes. IEEE Geosci. Remote Sens. Lett. 2018, 15, 345–349. [Google Scholar] [CrossRef]
- Zhu, Q.; Zhong, Y.; Wu, S.; Zhang, L.; Li, D. Scene Classification Based on the Sparse Homogeneous– Heterogeneous Topic Feature Model. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2689–2703. [Google Scholar] [CrossRef]
Name | ♯images | Name | ♯images | Name | ♯images |
---|---|---|---|---|---|
airport | 360 | farmland | 370 | port | 380 |
bare land | 310 | forest | 250 | railway station | 260 |
baseball field | 220 | industrial | 390 | resort | 290 |
beach | 400 | meadow | 280 | river | 410 |
bridge | 360 | medium residential | 290 | school | 300 |
center | 260 | mountain | 340 | sparse residential | 300 |
church | 240 | park | 350 | square | 330 |
commercial | 350 | parking | 390 | stadium | 290 |
dense residential | 410 | playground | 370 | storage tanks | 360 |
desert | 300 | pond | 420 | viaduct | 420 |
Method | Training Ratios (%) | |
---|---|---|
50% | 80% | |
TEX-TS-Net (concatenation) | ||
SAL-TS-Net (concatenation) | ||
TEX-TS-Net (addition) | ||
SAL-TS-Net (addition) | ||
TEX-TS-Net (our feature fusion model) | ||
SAL-TS-Net (our feature fusion model) |
Method | Training Ratios (%) | |
---|---|---|
20% | 50% | |
TEX-TS-Net (concatenation) | ||
SAL-TS-Net (concatenation) | ||
TEX-TS-Net (addition) | ||
SAL-TS-Net (addition) | ||
TEX-TS-Net (our feature fusion model) | ||
SAL-TS-Net (our feature fusion model) |
Method | Training Ratios (%) | |
---|---|---|
10% | 20% | |
TEX-TS-Net (concatenation) | ||
SAL-TS-Net (concatenation) | ||
TEX-TS-Net (addition) | ||
SAL-TS-Net (addition) | ||
TEX-TS-Net (our feature fusion model) | ||
SAL-TS-Net (our feature fusion model) |
Method | Training Ratios (%) | |
---|---|---|
50% | 80% | |
SCK [33] | - | |
SPCK [34] | - | |
BoVW [53] | - | |
BoVW + SCK [33] | - | |
SIFT + SC [87] | - | |
SSEA [88] | - | |
MCMI [89] | - | |
OverFeat [54] | - | |
VLAD [90] | - | |
VLAT [90] | - | |
MS-CLBP + FV [91] | ||
GoogLeNet [51] | ||
CaffeNet [51] | ||
VGG-VD-16 [51] | ||
Bidirectional adaptive feature fusion [92] | - | |
CNN-ELM [86] | - | |
[93] | ||
Appearance-based [56] | - | |
TEX-Net-LF [61] | ||
CaffeNet with DCF [60] | ||
MDDC [56] | - | |
VGG-VD16 with DCF [60] | ||
DRB Ensemble [94] | - | |
LASC-CNN (single-scale) [58] | - | |
Aggregate strategy 1 [59] | ||
Aggregate strategy 2 [59] | ||
Fusion by addition [62] | - | |
LASC-CNN (multiscale) [58] | - | |
SHHTFM [95] | - | |
Ours |
Method | Training Ratios (%) | |
---|---|---|
20% | 50% | |
BoVW [93] | - | |
MS-CLBP+FV [93] | - | |
GoogLeNet [51] | ||
VGG-VD-16 [51] | ||
CaffeNet [51] | ||
DCA with concatenation [62] | - | |
[93] | ||
Fusion by concatenation [62] | - | |
Fusion by addition [62] | - | |
TEX-Net-LF [61] | ||
Bidirectional adaptive feature fusion [92] | - | |
Multilevel fusion [66] | - | |
Ours |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, Y.; Liu, F. Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification. Remote Sens. 2018, 10, 1158. https://doi.org/10.3390/rs10071158
Yu Y, Liu F. Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification. Remote Sensing. 2018; 10(7):1158. https://doi.org/10.3390/rs10071158
Chicago/Turabian StyleYu, Yunlong, and Fuxian Liu. 2018. "Dense Connectivity Based Two-Stream Deep Feature Fusion Framework for Aerial Scene Classification" Remote Sensing 10, no. 7: 1158. https://doi.org/10.3390/rs10071158