1. Introduction
Land cover classification has been used in change monitoring [
1], construction surveying [
2], agricultural management [
3], digital terrain model (DTM) generation [
4], and identifying emergency landing sites for UAVs during engine failures [
5,
6]. Some other uses of land cover classification are for biodiversity conservation [
7], land-use [
8], and urban planning [
9].
Tree, shrub, and grass are three of the vegetation-type land covers and classification of them using remote sensing data has several important applications. For example, people have utilized shrub information for assessing the condition of grassland to determine whether a grassland has become unusable because of shrub encroachment or not [
3]. In emergency landing of unmanned air vehicles (UAVs), it is critical to land on grassland rather than on trees or shrubs [
5,
6]. Removing tall vegetation from the digital surface model (DSM) such as trees and shrub is an important step in developing an accurate digital terrain model (DTM) [
2]. Traditionally, normalized difference of vegetation index (NDVI) has been used for vegetation detection. However, NDVI cannot differentiate tree, shrub, and grass because of their similar spectral characteristics. Moreover, NDVI requires near infrared (NIR) band, which may not be available sometimes.
For accurate classification of these three vegetation land covers, the use of light detection and ranging (LiDAR) data with height information via the extracted digital terrain model (DTM) is highly beneficial to assist the classification process [
3] since these three vegetation types differ with respect to their height. Nonetheless, LiDAR may help detecting tall trees, but it is still challenging to distinguish some shrubs from grass [
10]. Moreover, NIR and LiDAR data may be expensive to acquire.
Other than the use of LiDAR for extracting height information in the form of DTM, there is also considerable interest in the remote sensing community to estimate DTMs using stereo images [
11,
12,
13,
14]. The DTM estimations from the stereo images could however be noisy at lower heights. Auxiliary methodologies that utilize spatial information of land covers together with their spectral information could be helpful to make these DTM estimations more accurate. In contrast to NIR and LiDAR data, RGB images can be easily obtained with low-cost color cameras. The cost issue is especially important for farmers who may have limited budget. In many agricultural monitoring applications, farmers like to simply use a low-cost drone with an onboard low cost color camera to fly over farmlands for agricultural condition monitoring.
There is an increasing interest in adapting deep learning methods for land cover classification after several breakthroughs have been achieved in a variety of computer vision tasks, including image classification, object detection and tracking, and semantic segmentation. In [
7], a comparison of convolutional neural network (CNN)-based methods with the state-of-the-art object-based image analysis methods is provided for the detection of a protected plant from a shrub family, Ziziphus lotus shrubs, using high-resolution Google Earth TM images. The authors reported higher accuracies with the CNN-detectors compared to other investigated object-based image analysis methods. In [
15], progressive cascaded convolutional neural networks are used for single tree detection with Google Earth imagery. In [
16], Basu investigated deep belief networks, basic CNNs, and stacked denoising autoencoders on the SAT-6 remote sensing dataset, which includes barren land, trees, grassland, roads, buildings, and water bodies as land cover type. In [
17], low-color descriptors and deep CNNs are evaluated on the University of California Merced Land Use dataset (UCM) with 21 classes. In [
18], a comprehensive review on land cover classification and object detection approaches using high resolution imagery is provided. The authors evaluated the performances of deep learning models against traditional approaches and concluded that the deep learning-based methods provide an end-to-end solution and show better performance than the traditional pixel-based methods by utilizing both spatial and spectral information. A number of other works have also shown that semantic segmentation classification with deep learning methods at a pixel level are quite promising in land cover classification [
19,
20,
21,
22].
In this paper, we focused on three vegetation land cover (tree, shrub, and grass) classification using only RGB images. We used a semantic segmentation deep learning method, DeepLabV3+ [
23], which has been proven to perform better than conventional deep learning methods such as Semantic Segmentation (SegNet) [
24], Pyramid Scene Parsing Network (PSP) [
25], and Fully Convolutional Networks (FCN) [
26]. DeepLabV3+ uses color image as the only input and does not need any feature extraction process such as texture. In our experiments, we used the Slovenia dataset [
27], which is a low resolution dataset (10 m per pixel) and a custom dataset from Oregon, US area. The land cover map of this area, which has 1 m per pixel resolution is in public domain [
28] and we obtained the color image (~0.25 m/pixel) from Google Maps. Both Slovenia and Oregon datasets included these three vegetation types in addition to some other land cover types.
DeepLabV3+ is first applied to both to low and high resolution datasets using all land covers. In both datasets, the number of pixels representing some of the land covers are fewer in number in comparison to other land covers, making the two datasets heavily imbalanced. Using suggestions from the developers of DeepLabV3+, which are posted in their GitHub page [
29], we extracted the number of pixels information for each of the land covers and then computed the median frequency weights [
30] and we assigned these weights to land cover classes when training DeepLabV3+ models. For comparison purposes, we considered using both uniform weights and median frequency weights when training. With uniform weights, we noticed that the classification accuracies of the underrepresented classes such as shrub, had quite low classification accuracies. After the use of median frequency weights [
30], the classification accuracies of the underrepresented classes were improved considerably. The trade-off for this was degradation in the accuracies of overrepresented classes such as tree. We then applied the same classification investigation on the two datasets but this time by including only the three vegetation classes (tree, shrub, and grass) and excluding all other land cover classes from the classification. In doing this, it is assumed that the three vegetation classes can be separated from other land covers. The objective of this investigation was to create a pure classification scenario that focuses only on these three vegetation classes by eliminating the impact of all other land covers’ misclassifications on these three vegetation classes’ accuracy and thus better assess DeepLabV3+’s classification performance. This investigation showed similar trends with respect to using median frequency weights. With uniform weights, shrub detection was very poor, which then significantly improved with median frequency weights. Moreover, when the vegetation-only classification results are compared with the classification results of all land covers, a considerable classification accuracy improvement had been observed in all three vegetation types. Other than these, this analysis also indicated that the highest correct classification accuracy corresponded to tree whereas shrub was the most difficult one to correctly classify.
The classification performance and computation time comparison of DeepLabV3+ with two other pixel-based machine learning classification methods, support vector machine [
31] and random forest [
32] showed that DeepLabV3+ generates more accurate classification results with a trade-off for longer model training time.
It should be emphasized that we only used RGB bands without any help from LiDAR, NIR bands, or stereo images and we still managed to get 78% average classification accuracy in Slovenia dataset and 79% average classification accuracy in Oregon dataset for trees, shrubs, and grass (vegetation-only classification). Compared to the results in [
3] (even though the dataset used in that work was different from ours), the results in [
3] attained only 53% for the combined class of trees and shrubs when Red+Green+NIR bands were used. This clearly shows that the standalone use of DeepLabV3+ with only RGB images for classifying trees, shrubs, and grass is effective to some extent. It is also low cost since low resolution color cameras can be used. Moreover, this can be considered as an auxiliary methodology to help making LiDAR-extracted or stereo-image extracted DTM estimations more accurate. The contributions of this paper are:
Provided a comprehensive evaluation of a deep learning-based semantic segmentation method, DeepLabV3+, for the classification of three similar looking vegetation types, which are tree, shrub, and grass, using color images only with both low resolution and high resolution and outlined classification performance and computation time comparisons of DeepLabV3+ with two pixel-based classifiers.
Discussed the data imbalance issue with DeepLabV3+ and demonstrated that the average class accuracy can be increased considerably in DeepLabV3+ using median frequency weights during model training in contrast to using uniform weights.
Demonstrated that a higher classification accuracy can be achieved for each of the three vegetation types (tree, shrub, and grass) with DeepLabV3+ if the classification can be limited to the three green vegetation classes only rather than including all land covers that are present in the image datasets.
Provided insights about which of these three vegetation types are more challenging to classify.
Our paper is organized as follows.
Section 2 provides technical information about DeepLabV3+ and the datasets used in our experiments.
Section 3 contains two case studies (8-class and 3-vegetation-only class) for Slovenia dataset and another two case studies (6-class and 3-vegetation-only class) for the Oregon dataset, and a performance and computation time comparison study of DeepLabV3+ with two pixel-based classifiers. Finally,
Section 4 concludes the paper with some remarks.
4. Conclusions
Without using NIR and LiDAR, it is challenging to correctly classify trees, shrubs, and grass. In some cases, even the use of NIR and LiDAR may not provide highly accurate results and it is important to utilize auxiliary methods which could be used as supportive information to increase the confidence of the classification decisions using LiDAR data. In this paper, we report some new results using a semantic segmentation based deep learning method to tackle the above challenging problem using only RGB images.
We provided a comprehensive evaluation of DeepLabV3+ for classification of three similar looking vegetation types, which are tree, shrub, and grass, using color images only with both low resolution and high resolution datasets. The data imbalance issue with DeepLabV3+ is discussed and it is demonstrated that the average class accuracy can be increased considerably in DeepLabV3+ using median frequency weights during model training in contrast to using uniform weights. It is observed from both datasets that higher tree, grass, and shrub classification accuracy can be achieved with DeepLabV3+ if the classification can be limited to these three vegetation classes only rather than including all other land cover types that are present in the color image datasets. In both Slovenia and Oregon datasets, it is observed that the highest classification accuracy corresponds to “tree” type whereas “shrub” type is found the most challenging to classify accurately. In addition, the performance of DeepLabV3+ is compared with two state-of-the-art machine learning classification algorithms (SVM and random forests) which use RGB pixel values, GLCM and Gabor texture features, and combination of the two sets of texture features. It is observed that DeepLabV3+ outperforms both SVM and random forests. Being a semantic segmentation-based method, DeepLabV3+ has advantages over pixel-based classifiers by utilizing both spectral (via RGB bands only) and spatial information.
Future research directions include customization of DeepLabV3+ framework to accept more than three channels (adding NIR band to three color bands) and utilization of digital terrain model (DTM) in the form of using LiDAR sensor data or in the form estimating DTM through stereo satellite images to further improve the classification accuracy of tree, grass, and shrub.