[go: up one dir, main page]

Next Article in Journal
Enhanced Dual-Channel Model-Based with Improved Unet++ Network for Landslide Monitoring and Region Extraction in Remote Sensing Images
Previous Article in Journal
First Release of the Optimal Cloud Analysis Climate Data Record from the EUMETSAT SEVIRI Measurements 2004–2019
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Few-Shot Hyperspectral Remote Sensing Image Classification via an Ensemble of Meta-Optimizers with Update Integration

1
Interdisciplinary Data Mining Group, School of Mathematics, Shandong University, Jinan 250100, China
2
Wolfson College, University of Oxford, Oxford OX2 6UD, UK
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(16), 2988; https://doi.org/10.3390/rs16162988
Submission received: 7 May 2024 / Revised: 29 July 2024 / Accepted: 13 August 2024 / Published: 14 August 2024
Figure 1
<p>The framework of our proposed method.</p> ">
Figure 2
<p>Loss changes on the PaviaC dataset.</p> ">
Figure 3
<p>Visual classification results on the PaviaC dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.9216), (<b>d</b>) AdaGrad (0.9069), (<b>e</b>) RMSprop (0.8965), (<b>f</b>) SGD (0.8253), (<b>g</b>) SGD with momentum (0.9291), (<b>h</b>) LSTM optimizer (0.9342), (<b>i</b>) MOE-A (0.9222), and (<b>j</b>) MOE-U (0.9408).</p> ">
Figure 3 Cont.
<p>Visual classification results on the PaviaC dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.9216), (<b>d</b>) AdaGrad (0.9069), (<b>e</b>) RMSprop (0.8965), (<b>f</b>) SGD (0.8253), (<b>g</b>) SGD with momentum (0.9291), (<b>h</b>) LSTM optimizer (0.9342), (<b>i</b>) MOE-A (0.9222), and (<b>j</b>) MOE-U (0.9408).</p> ">
Figure 4
<p>Loss changes on the PaviaU dataset.</p> ">
Figure 5
<p>Visual classification results on the PaviaU dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.6918), (<b>d</b>) AdaGrad (0.7050), (<b>e</b>) RMSprop (0.6417), (<b>f</b>) SGD (0.5290), (<b>g</b>) SGD with momentum (0.6475), (<b>h</b>) LSTM optimizer (0.6444), (<b>i</b>) MOE-A (0.6515), and (<b>j</b>) MOE-U (0.6583).</p> ">
Figure 6
<p>Loss changes on the Salinas dataset.</p> ">
Figure 7
<p>Visual classification results on the Salinas dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.3873), (<b>d</b>) AdaGrad (0.4571), (<b>e</b>) RMSprop (0.3873), (<b>f</b>) SGD (0.2727), (<b>g</b>) SGD with momentum (0.3885), (<b>h</b>) LSTM optimizer (0.5837), (<b>i</b>) MOE-A (0.6765), and (<b>j</b>) MOE-U (0.6236).</p> ">
Figure 8
<p>Loss changes on the SalinasA dataset.</p> ">
Figure 9
<p>Visual classification results on the SalinasA dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.7556), (<b>d</b>) AdaGrad (0.7467), (<b>e</b>) RMSprop (0.7609), (<b>f</b>) SGD (0.6605), (<b>g</b>) SGD with momentum (0.8708), (<b>h</b>) LSTM optimizer (0.9476), (<b>i</b>) MOE-A (0.9139), and (<b>j</b>) MOE-U (0.9277).</p> ">
Figure 10
<p>Loss changes on the PaviaC dataset.</p> ">
Figure 11
<p>Visual classification results on the PaviaC dataset: (<b>a</b>) LSTM optimizer (0.9407), (<b>b</b>) MOE-A (0.9060), and (<b>c</b>) MOE-U (0.9194).</p> ">
Figure 12
<p>Loss changes on the KSC dataset.</p> ">
Figure 13
<p>Visual classification results on the KSC dataset: (<b>a</b>) False-color image. (<b>b</b>) Ground-truth map. Classification maps obtained by (<b>c</b>) Adam (0.4481), (<b>d</b>) AdaGrad (0.3373), (<b>e</b>) RMSprop (0.5296), (<b>f</b>) SGD (0.2810), (<b>g</b>) SGD with momentum (0.4123), (<b>h</b>) LSTM optimizer (0.6173), (<b>i</b>) MOE-A (0.5878), and (<b>j</b>) MOE-U (0.6319).</p> ">
Figure 14
<p>Loss changes on the PaviaC dataset.</p> ">
Figure 15
<p>Visual classification results on the PaviaC dataset: (<b>a</b>) LSTM optimizer (0.9351), (<b>b</b>) MOE-A (0.9401), and (<b>c</b>) MOE-U (0.9423).</p> ">
Versions Notes

Abstract

:
Hyperspectral images (HSIs) with abundant spectra and high spatial resolution can satisfy the demand for the classification of adjacent homogeneous regions and accurately determine their specific land-cover classes. Due to the potentially large variance within the same class in hyperspectral images, classifying HSIs with limited training samples (i.e., few-shot HSI classification) has become especially difficult. To solve this issue without adding training costs, we propose an ensemble of meta-optimizers that were generated one by one through utilizing periodic annealing on the learning rate during the meta-training process. Such a combination of meta-learning and ensemble learning demonstrates a powerful ability to optimize the deep network on few-shot HSI training. In order to further improve the classification performance, we introduced a novel update integration process to determine the most appropriate update for network parameters during the model training process. Compared with popular human-designed optimizers (Adam, AdaGrad, RMSprop, SGD, etc.), our proposed model performed better in convergence speed, final loss value, overall accuracy, average accuracy, and Kappa coefficient on five HSI benchmarks in a few-shot learning setting.

1. Introduction

Hyperspectral remote sensing is a valuable monitoring technique in geoinformation that typically captures hundreds of narrow-spectral-band sets of information from the same region, revealing unique physical features of ground targets. The high spectral dimensions and spatial resolution in the hyperspectral imaging (HSI) process make accurate discrimination of different land-cover classes possible [1,2,3]. Since the HSI is formed pixel-by-pixel, its classification is to assign a unique label, representing a specific area, to each HSI pixel. At present, HSI classification has become an indispensable data-driven technique to achieve the United Nations Sustainable Development Goals (SDGs). For example, in agricultural settings, HSI classification is used to accurately capture the status of vegetation health, soil type, and moisture and then help to make reasonable decisions on fertilization, irrigation, and management [4,5]; in geological exploration, HSI classification is used to identify geological features, such as mineral and rock types, which aids in finding potential mineral deposits [6]; in environmental protection, HSI classification is used to monitor desertification processes and changes in land use [7,8,9,10,11]. HSI classification can also monitor natural disasters such as fires, floods, and earthquakes, making disaster response and mitigation efforts simpler and quicker [12,13].
Traditional HSI classifiers rely heavily on hand-crafted features [14,15,16,17]. Benediktsson et al. [18] constructed morphological profile features with the help of mathematical morphology. Jia et al. [19] used 3D Gabor filtering to extract spatial structure information. These carefully designed features were then fed into classifiers, such as support vector machines (SVMs) [20] and random forests [21]. Feature extraction techniques, such as principal component analysis (PCA) [22], independent component analysis (ICA) [23], and linear discriminant analysis (LDA) [24], were also introduced to reduce redundant spectral information and produced more distinguishable features. By incorporating spatial contextual information, spectral–spatial features can further improve the classification performance [17,25]. Compared with traditional techniques, deep learning can automatically learn and capture high-level features from training data and achieve better classification accuracy [26,27,28,29,30]. Chen et al. [26] inputted spectral vectors and spatial features extracted by PCA into a stacked autoencoder to generate high-level joint spectral–spatial features and then implemented the classification by a logistic regression layer and achieved more accurate results than by traditional machine learning on KSC and Pavia HSI datasets. Mei et al. [30] used a 3D convolutional network to extract spatial and spectral features simultaneously and then achieved better results on Indian Pine, Salinas, and Pavia HSI datasets.
Manually labeling HSI samples is not only time-consuming and laborious but also requires much professional expertise. Due to the high annotation cost, only a limited number of annotated samples may be available, leading to a small training dataset size. Moreover, in hyperspectral images, the number of pixels for different land-cover classes can be significantly imbalanced, with some classes having far fewer samples than others. Severe overfitting occurs when training neural networks with insufficient samples. Current HSI classification faces the challenge of few-shot learning [31]. As an advance of deep learning, meta-learning [32,33] has become the most promising approach to deal with the issue of few-shot learning.
The idea of meta-learning is to enable models to not only acquire task-specific knowledge but also learn how to learn new tasks more effectively, so it allows models to make better use of limited training samples and demonstrate robust generalization ability on new tasks [34,35,36]. Although meta-learning has begun to be used in few-shot HSI classification [2,37], the impacts of certain optimizers on the final classification performance are often ignored. Since the popular human-designed optimizers, such as stochastic gradient descent (SGD) [38], RMSprop [39], AdaGrad [40], and Adam [41], are designed for general tasks, these optimizers cannot improve their performance through mining unique features of a specific HSI task, and they can possibly lead to serious overfitting, especially when dealing with insufficient training samples [42]. Incorporating prior knowledge of the training task into an optimizer can significantly reduce the size of the required training data and avoid overfitting [43,44]. Such an optimizer is called a meta-optimizer or meta-learning-based optimizer [45].
This study aimed to improve meta-optimizers and apply them in few-shot HSI classification. Since ensemble learning can improve the generalization ability by combining individual learners and then reduce overfitting risks and enhance the model robustness, we propose an ensemble of meta-optimizers that were generated one by one through utilizing periodic annealing on the learning rate during the meta-training process. Such a combination of meta-learning and ensemble learning not only has no significant added training costs but also demonstrates a powerful ability to optimize the deep network on few-shot HSI training. In each optimization iteration, a meta-optimizer ensemble can generate several candidates to update the parameters of the trained model. In order to integrate these candidate updates to effectively improve the model performance, we further developed an update integration technique to measure and incorporate the potential of each candidate. The real-world experiments on few-shot HSI classification consistently verified the effectiveness of our meta-optimizer ensemble with update integration over the widely used human-designed optimizers across five benchmark HSI datasets.

2. Related Works

During the HSI classification process, when sufficient labeled samples are provided, accurate classification can be easily implemented under a supervised classification framework [46]. Let a learning task T be specified by a dataset D = { x i , y i } i = 1 n and a loss function L . The dataset D is split into a training set D t r a i n = { x i , y i } i = 1 k and a test set D t e s t = { x i , y i } i = k + 1 n . The learning target is to obtain a predictive function f with parameters θ by solving the following optimization equation:
θ = arg min θ x , y D t r a i n L f θ x , y
where the sample subscript i is omitted for simplicity. The performance of f is finally evaluated on D t e s t . When only limited labeled HSI samples are provided, Equation (1) is prone to becoming stuck in local optima.
Meta-learning aims to make the model learn how to learn so that it can more flexibly adapt to various learning scenarios. It consists of the learner, which is responsible for specific task learning, and the meta-learner, which guides the learner’s learning process. The training of the meta-learner typically involves two stages: the first stage is to learn across a set of tasks, and the second stage is to evaluate and adjust the meta-learner on new tasks [42]. The optimal meta-knowledge ω is found by
ω = arg min ω E L ~ p T L ω ; D
where p T is a distribution of tasks, and a task specifies a dataset and loss function T = { D , L } . In an actual implementation, only M source tasks D s o u r c e = { D s o u r c e t r a i n , D s o u r c e t e s t i } i = 1 M are randomly sampled from the task distribution p T , so the expectation in the meta-training phase (2) is replaced simply by summation:
ω = arg min ω i = 1 M L D s o u r c e i ; ω .
This meta-knowledge   ω is further used to train the predictive model f with parameter θ on Q target tasks D t a r g e t = { D t a r g e t t r a i n , D t a r g e t t e s t i } i = 1 Q . For each target task i , the meta-testing stage is written as follows:
θ   ( i ) = arg min θ L D t a r g e t t r a i n i ; θ ; ω .
Compared with Equation (1) in typical supervised learning, the learning phase of Equation (4) benefits from the learned meta-knowledge ω .
The meta-knowledge ω can be in the form of the initial parameters [32] or optimization strategy [44], etc. By treating meta-knowledge ω as the optimization strategy, learning to optimize is proposed to optimize the learning process itself [47]. Specifically, Andrychowicz et al. [43] trained a two-layer LSTM network (LSTM optimizer) to generate dynamic updates. The idea is to replace the regular update with an update generated by an LSTM network. Guided by the meta-knowledge stored in the network parameters, the LSTM optimizer can incorporate historical gradient information to generate updates. Andrychowicz et al. [43] demonstrated that the LSTM optimizer outperforms the human-designed optimizers in a variety of tasks, including training neural networks, convex problems, and styling images with neural art. Ravi et al. [44] used LSTM networks to learn update rules for few-shot learning and set the cell state of LSTM networks as the learner’s parameters, and the candidate cell state determined the gradient information for parameter updates. Li et al. [48] proposed a conceptually simple meta-optimizer, Meta-SGD, for few-shot learning. Meta-SGD learns the initialization and learning rate instead of the whole update rules. Compared with the LSTM optimizer, Meta-SGD is more straightforward to implement, but its generalization capacity needs to be improved. Chen et al. [49] trained an RNN-based meta-optimizer for global optimization of black-box functions. Their meta-optimizer, trained on synthetic functions, can optimize a broad class of black-box functions. Wang et al. [45] designed a meta-optimizer called HyperAdam, the parameter update generated by which is an adaptive combination of multiple candidate updates produced by Adam using different decay rates.
Hyperspectral images have high spectral dimensionality and high spatial complexity, which may slow down the convergence speed of classification during training. A meta-optimizer can learn some knowledge from an HSI dataset and then use the learned knowledge to optimize a model on a new HSI dataset. To the best of our knowledge, no study has focused on learning a meta-optimizer for few-shot HSI classification.

3. Proposed Method

Hyperspectral images (HSIs) with abundant spectra and high spatial resolution can satisfy the demand for accurately determining specific land-cover classes. Due to the significant variance within the same class in hyperspectral images, few-shot HSI classification becomes especially difficult. To solve this issue without adding training costs, we propose a new approach: an ensemble of meta-optimizers with update integration for few-shot HSI classification.

3.1. Ensembles of Meta-Optimizers

The training in conventional machine learning can be expressed as the optimization in Equation (1). Its solution is always based on stochastic gradient descent and its variants through the following iterative process:
θ t + 1 = θ t α t · θ t L
where α t is the learning rate at the time step t . Since Equation (5) usually requires thousands of iterations to find the optimum or local optimum, the performance of these optimizers will become very weak if the training samples are insufficient. To overcome this drawback, the meta-optimizer (i.e., an optimizer trained by meta-learning) is used to replace the human-designed stochastic gradient descent algorithms. It may take the form of a two-layer long short-term memory (LSTM) network, which can integrate information from the history of gradients to determine the parameter update. Such a meta-optimizer is also called an LSTM optimizer. Denote the hidden state of the LSTM with h and the output with g , and by using the LSTM optimizer G ϕ to optimize the parameters θ of a model f , the sequence of updates will become
g t , h t = G ϕ θ t L , h t 1
θ t + 1 = θ t + g t .
In order to solve the challenge that different input gradients have very different magnitudes, making the meta-optimizers difficult to train, we adopted the standard normalization process in [43] to rescale the input gradients:
x   l o g x p , s g n x     i f x e p 1 , e p x                                   o t h e r w i s e
where p > 0 is the parameter controlling how small gradients are disregarded.
Minimizing the loss function
L m e t a ϕ = E f k T L ( θ t ; D ) .
can obtain the optimal parameters ϕ of meta-optimizer G , which can reduce the objective function L ( θ t ; D ) as much as possible.
To build an ensemble of meta-optimizers without adding training costs, we performed periodic annealing on the learning rate. When the learning rate anneals to the minimum value, the meta-optimizer parameters are saved and added to the ensemble. In this way, we can obtain multiple meta-optimizers with different parameters without increasing the training costs. The learning rate α changes in each annealing cycle according to the following formula:
α = α m i n + 1 2 ( α m a x α m i n ) ( 1 + c o s ( t c u r T p ) )
where α m a x and   α m i n indicate the learning rate range, and   t c u r is the number of iterations since the last restart. A good α m a x can accelerate the training process and help the model escape from local minima. The   α m i n is a small enough number. The T p is a hyper-parameter representing the period of cosine annealing. When the training process starts, t c u r = 0 and α = α m a x . After T p iterations, the learning rate will decrease to its minimum α m i n , and one annealing cycle will be completed. At the beginning of the next annealing cycle, t c u r will become 0 again, and the learning rate will abruptly become α m a x .
When the learning rate is small, the trained model tends to converge into the closest local minimum [50]. Once the learning rate reaches its minimum α m i n , the corresponding meta-optimizer will be added to the meta-optimizer ensemble. Then, a large enough learning rate α m a x is used to escape the current local minimum and restart a new annealing cycle. After N annealing cycles, we obtain an ensemble of N meta-optimizers, denoted as M O E N = { G ϕ i } i = 1 N .
The ensemble process of meta-optimizers with learning rate annealing is shown in Algorithm 1. The whole meta-training phase is a double-loop structure, which differs from the traditional training process.
Algorithm 1 The proposed ensemble of meta-optimizers.
  • Require:
  • 1: The predictive model f with initial parameters θ 0
  • 2: Source HSI dataset D s o u r c e = { ( x i , y i ) } i = 1 N ; batch size B.
  • 3: Meta-optimizer G with initial parameters ϕ 0 ,   M O E N = .
  • 4: Learning rate α , learning rate range α m a x  and  α m i n ; annealing period T p .
  • 5: loss function L ; meta-loss function L m e t a .
  • Begin:
  • 6: While  t c u r = 0 , 1 , 2 ,  do
  • 7:          L m e t a = 0
  • 8:          α = α m i n + 1 2 ( α m a x α m i n ) ( 1 + c o s ( t c u r T p ) )
  • 9:         for  t = 1 , 2 , , T  do
  • 10:              Randomly draw B training samples D B = { ( x i , y i ) } i = 1 B  from  D s o u r c e
  • 11:              Calculate the current loss value L t  on  D B
  • 12:              Calculate the gradients θ t   L t   of f
  • 13:              Normalize θ t L t   [ θ t L t ] n o r m according to Equation (8)
  • 14:               g t , h t G ϕ [ θ t L t ] n o r m , h t 1
  • 15:               θ t + 1 θ t + g t
  • 16:               L m e t a L m e t a + L t
  • 17:      end for
  • 18:      Calculate L m e t a ϕ  
  • 19:       ϕ ϕ α · L m e t a ϕ
  • 20:      If  α = α m i n  then
  • 21:              Add G ϕ to the M O E N
  • 22:               t c u r = 0
  • 23: end while
  • 24: return  M O E N = { G ϕ i } i = 1 N
  • end

3.2. Update Integration

When optimizing a neural network, the meta-optimizer ensemble M O E N = { G ϕ i } i = 1 N can generate N candidate updates in each optimization iteration. To optimally integrate the candidate updates   { g i } i = 1 N into the final update g and achieve faster convergence, we propose an update integration algorithm that can select a candidate update as the final update by estimating the quality of each update. The algorithm’s core lies in how to measure the quality of each update. Typically, a better update at step t results in lower loss values for the predictive model from step t + 1 to the final step T . However, considering too many subsequent loss values is computationally expensive. As a compromise, only the loss value at step t + 1 is used to quantitatively measure the update at step t . At the same time, historical loss values are also used as a reference for measurement, since the meta-optimizer that produces better updates earlier will often have a higher probability of producing a better update later, and historical loss information can help reduce the stochasticity in calculating loss values using batch data. The details of the proposed update integration algorithm are as follows:
Suppose that the predictive model to be updated is f θ . At the time step t , we sample a batch of samples from the target dataset and calculate the normalized gradients [ θ t L ] n o r m . The normalized gradients are then inputted into the M O E N to obtain N candidate updates { g t i } i = 1 N . Then, we sample another mini-batch of training data D m i n i = { x i , y i } i = 1 b (   b is a small number) to calculate N new loss values L θ t + g t i ; D m i n i   ( i = 1,2 , , N ) and the following equations:
m t i = β m t 1 i + 1 β L θ t + g t i ; D m i n i L θ t     i = 1,2 , , N ,  
i = arg min i m t i g t = g t i                          
where β 0,1 is a hyper-parameter. Based on the newly sampled D m i n i , L θ t + g t i ; D m i n i can approximate the loss values L θ t + 1 at step t + 1 by assuming g t i as the actual update. Consequently, L θ t + g t i ; D m i n i L θ t measures which candidate update can drop the loss value the most. Each m t i is a real number iterating over the time step t , and its initial value is set to 0. The update integration algorithm determines the optimal update g t at the time step t according to the minimum among m t i   ( i = 1,2 , , N ) , as suggested in Equation (12). When β = 0 , Equation (11) will become m t i = L θ t + g t i ; D m i n i L θ t . This means that the update integration algorithm determines the optimal update based only on the current loss reduction L θ t + g t i ; D m i n i L θ t . However, loss values can vary widely from one mini-batch of data to another, so we propose calculating the loss reductions with previous loss information to eliminate the stochasticity, which corresponds to the cases of β 0 in Equation (11).
Averaging of the outputs of an ensemble is widely used. Compared with the average method, our proposed update integration algorithm can flexibly choose an update direction with a lower loss value by looking one step ahead at the loss changes.

3.3. Incorporating Update Integration into Meta-Optimizer Ensembles

The framework of the proposed method is shown in Figure 1. In each optimization iteration, we first input a batch of training samples to the neural network to be optimized and perform backpropagation to obtain the gradient information θ L of the neural network. The meta-optimizer ensemble { G ϕ i } i = 1 N then takes the normalized gradients   [ θ L ] n o r m as inputs, and each meta-optimizer G ϕ i outputs its suggested update g i independently, directed by its parameters ϕ i . Finally, the candidate updates { g i } i = 1 N are integrated into the final update g by using the proposed update integration algorithm.
Since the meta-optimizer ensemble has learned how to optimize from related tasks, it requires fewer training samples, leading to low computational costs and memory requirements and faster convergence in fewer iterations. Algorithm 2 illustrates the total pseudo-code for using the proposed meta-optimizer ensemble with the update integration algorithm to optimize a model f .
Algorithm 2 The proposed update integration algorithm.
  • Require:
  • 1: The trained model f with initial parameters θ 1 .
  • 2: Target dataset D t a r g e t = x j , y j j = 1 K ; batch size B, mini-batch size b.
  • 3: M O E N = { G ϕ i } i = 1 N , hyper-parameter β .
  • Initialize:  m 0 i = 0   i = 1 , 2 , , N
  • 4: for  t = 1 , 2 , , T  do
  • 5:        Draw random batch training data D t r a i n = x j , y j j = 1 B  from  D t a r g e t
  • 6:        Calculate the gradients θ t L t  on  D t r a i n
  • 7:        Normalize θ t L t   [ θ t L t ] n o r m according to Equation (8)
  • 8:        Draw another random mini-batch D m i n i = { x j , y j } j = 1 b  from  D t a r g e t
  • 9:        for  i = 1 , 2 , , N  do
  • 10:               g t i G ϕ i [ θ t L t ] n o r m , h t 1 i
  • 11:            Calculate L θ t + g t i ; D m i n i
  • 12:               m t i = β m t 1 i + 1 β L θ t + g t i ; D m i n i L θ t
  • 13:       end for
  • 14:        i = arg min i m t i
  • 15:        g t = g t i
  • 16:        θ t + 1 = θ t + g t
  • 17: end for
  • 18: return the final parameters θ T + 1

4. Real-World HSI Classification Experiments

We focused the real-world HSI classification under the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land cover classes to form the training set. We illustrated the effectiveness of our proposed method on five real-world HSI datasets—PaviaU, PaviaC, Salinas, SalinasA, and KSC. Due to environmental changes and observation conditions, the spectral features of the same land-cover class may vary at different times or locations. Moreover, hyperspectral images of the same region acquired by different satellites may have some differences due to band configuration, spatial resolution, radiometric calibration, etc. Therefore, we designed the following seven source–target tasks to test the generalization and robustness ability of our proposed method. In each task, we trained the meta-optimizer on the source dataset and then used it to optimize a predictive model on the target dataset with few labeled data. The real-world HSI classification experimental tasks consisted of different source–target combinations:
(1)
Trained on PaviaU and tested on PaviaC (PaviaU–PaviaC task);
(2)
Trained on PaviaC and tested on PaviaU (PaviaC–PaviaU task);
(3)
Trained on SalinasA and tested on Salinas (SalinasA–Salinas task);
(4)
Trained on PaviaC and tested on SalinasA (PaviaC–SalinasA task);
(5)
Trained on SalinasA and tested on PaviaC (SalinasA–PaviaC task);
(6)
Trained on PaviaC and tested on KSC (PaviaC–KSC task);
(7)
Trained on KSC and tested on PaviaC (KSC–PaviaC task).
The PaviaU dataset and the PaviaC dataset were acquired by the ROSIS sensor, so tasks (1) and (2) can evaluate the generalization ability in different regions under the same sensor. The SalinasA dataset is a small subset of Salinas, so task (3) can test the generalization ability to deal with a large number of unseen classes. Robustness measures the ability of the model to maintain its functionality and performance under abnormal scenarios; so, to demonstrate the robustness of our model, tasks (4)–(7) adopted the strictest abnormal scenarios, where an HSI dataset from one sensor was used to train the model and an HSI dataset from a different sensor was used to test the robustness of the model. Our code is available at https://github.com/lazyhaotao/MOE-U.

4.1. Description of Datasets

The PaviaU HSI dataset was captured over Pavia using the ROSIS sensor. Its spatial size is 610 × 340 , and the number of spectral bands is 103. The PaviaU HSI dataset contains 42,776 labeled pixels in 9 land-cover classes (Table 1). The PaviaC HSI dataset was also acquired by the ROSIS sensor over the same region as PaviaU. Its spatial size is 1096 × 715 , and the number of spectral bands is 102. PaviaC has 148,152 labeled pixels in 9 land-cover classes (Table 2). The spatial resolution of these two datasets is 1.3 m.
The Salinas HSI dataset was collected by the 224-band AVIRIS sensor over Salinas Valley. Its spatial size is 512 × 217 , and the number bands is 204. It has 54,129 labeled samples classified into 16 land-cover classes (Table 3). The SalinasA HSI dataset is a small sub-scene of the Salinas dataset. It has a size of 83 × 86 and 6 land-cover classes (Table 4). The spatial resolution of these two datasets is 3.7 m.
The KSC HSI dataset was collected by the AVIRIS sensor over the Kennedy Space Center (KSC), Florida. Its spatial size is 512 × 614 , and the number of spectral bands is 176. It has 5211 labeled pixels in 9 land-cover classes (Table 5). The spatial resolution of the KSC dataset is 18 m.
In summary, five HSI datasets (PaviaU, PaviaC, Salinas, SalinasA, and KSC) were used to test the performance of our proposed method. These five datasets from real-world scenarios contain a huge number of land-cover classes and have been widely used as test datasets in research on hyperspectral remote sensing image classification. Table 6 shows a comparison of these five real-world HSI datasets.

4.2. Experimental Setup

The performance of various optimizers was evaluated with the convergence speed and final convergence value of the predictive model’s loss function. We also evaluated the target HSI datasets for overall accuracy (OA), average accuracy (AA), and Kappa coefficient ( κ ), which indirectly reflect the performance of various optimizers and the degree of falling into overfitting. In detail, the OA is the proportion of correctly classified samples among all of the tested samples. The AA is defined as the average of all-class classification accuracy. Compared with OA, AA pays more attention to the classes with fewer samples. The Kappa coefficient is a statistical index evaluating the consistency and classification accuracy and is defined as κ = p 0 p e / 1 p e , where p 0 is the OA and p e is the hypothetical probability of the chance agreement.
The Adam optimizer with learning rate α = 0.01 was used to train the meta-optimizer ensemble based on Algorithm 1. During the meta-training phase, we set B = 128 and T = 100 . Then, we obtained an M O E 3 = { G ϕ i } i = 1 3 . For the update integration algorithm, all of the experimental results were reported with β = 0.1 . The LSTM optimizer consisted of a two-layer LSTM network, with each layer containing 20 hidden units. It was trained by meta-learning without using ensemble learning. The LSTM optimizer’s parameters were saved after 1000 meta-training episodes on the source dataset. For the normalization method, we set p = 10 as usual.
The predictive model f trained by the above optimizers on the target dataset was a four-layer fully connected network with a sigmoid activation function and each hidden layer containing 20 nodes. We input each HSI pixel of size 1 × 1 × c h a n n e l   n u m b e r to the predictive model for classification. The loss function used in our experiments was cross-entropy loss. Our real-world HSI classification experiments concentrated on the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land-cover classes to form the training set. We repeated each experiment ten times and reported the mean and the standard variation of various measurement indicators for classification performance.

4.3. Experimental Results

For brevity, we refer to the meta-optimizer ensemble with update integration as MOE-U, and to averaging all of the candidate updates of a meta-optimizer ensemble as MOE-A. We compared the performance of MOE-U against the LSTM optimizer, MOE-A, and a selection of popular human-designed optimizers: Adam, AdaGrad, RMSprop, SGD, and SGD with momentum, whose learning rates are 0.005, 0.1, 0.001, 0.5, and 0.5, respectively.

4.3.1. PaviaU–PaviaC Task

In this task, we first trained the meta-optimizers, including the LSTM optimizer, MOE-A, and MOE-U, on the PaviaU dataset. The meta-optimizers were then used to train the predictive model f on the PaviaC dataset. To avoid overfitting, we set the number of training iterations as 50. The averaged loss change curves are shown in Figure 2. Meta-optimizers had much faster convergence speed than the human-designed optimizers, indicating that the learned prior knowledge from other datasets acquired from the same sensor was used to accelerate the convergence speed and significantly reduce the final convergence loss. The final averaged convergence values and their standard deviations (Std) are shown in Table 7. Our MOE-U also achieved the lowest final convergence value of 0.3786 and the lowest Std of 0.1128, displaying better performance than the other two meta-optimizers. These numerical results confirm that our proposed update integration algorithm can effectively accelerate the convergence process and achieve lower loss simultaneously. The final convergence value of MOE-A was 0.3819, which was higher than that of MOE-U. The reason for this may be that averaging different parameter updates can bias the best update direction away from the correct direction, and our update integration process can determine the best update direction by estimating the change in loss.
The classification performances by different optimizers are shown in Table 8, indicating that, compared with human-designed optimizers, meta-optimizers can achieve lower loss and better classification results while avoiding overfitting. The predictive model optimized by MOE-U achieved 0.8791, 0.8324, and 0.8340 on OA, AA, and Kappa, respectively, outperforming human-designed optimizers. Compared with the LSTM optimizer, our MOE-U increased the OA, AA, and Kappa values by 0.0277, 0.0294 and 0.0357, respectively. Compared with our MOE-A, our MOE-U increased these values by 0.0151, 0.0245, and 0.0249, respectively.
In total, the results demonstrate that MOE-U has good generalization ability. It can optimize a predictive model with limited training samples by using the learned knowledge from other HSI datasets from the same sensor. The best classification maps of various methods are displayed in Figure 3. The number after the optimizer name represents the accuracy on the test set. Our MOE-U also achieved the highest classification accuracy of 0.9408.

4.3.2. PaviaC–PaviaU Task

Following the above experiment, we then trained meta-optimizers on PaviaC and tested their performance on PaviaU. The averaged loss change curves are shown in Figure 4, and the final averaged convergence values and their standard deviations (Stds) are given in Table 9. In this scenario, similar experimental results were achieved. MOE-U achieved the fastest convergence speed and the lowest convergence value of 0.6665.
The performance comparison of the classification results is shown in Table 10. The OA and Kappa values of our MOE-U were 0.6151 and 0.5195, respectively, outperforming all human-designed optimizers and meta-optimizers. The AA value of our MOE-U was just 0.0033 lower than that of our MOE-A. The distribution of samples across land-cover classes of PaviaU was highly unbalanced. For instance, meadows (C2) had 18,649 labeled samples while shadows (C9) had 947 labeled samples. This imbalance made training and testing very difficult, e.g., the Kappa of SGD was just 0.1991. Even in the face of such land-cover class imbalance, the meta-optimizers performed better than the human-designed optimizers, demonstrating that the learned meta-knowledge can effectively help optimize the network. The best classification maps by various optimizers are displayed in Figure 5.

4.3.3. SalinasA–Salinas Task

In this task, we trained the meta-optimizers on SalinasA and tested their classification performance on Salinas. Since the SalinasA dataset contains only 6 categories, while the Salinas dataset contains 16 categories, this task can test the generalization ability of our method in dealing with many unseen land-cover classes. The averaged loss change curves are shown in Figure 6, and the final averaged convergence values and their Stds are given in Table 11. In this experiment, MOE-U converged significantly faster and had lower losses than other optimizers.
The performance comparison of the classification results is shown in Table 12, indicating that the Salinas dataset was so hard to classify with few training samples that all of the human-designed optimizers failed in this experiment. This might be attributed to the considerably large number of classes in the Salinas dataset, and the samples between each class were so similar that they were difficult to distinguish. Although the human-designed optimizers reduced the training loss, they made the predictive model severely overfit, resulting in the OA, AA, and Kappa all being near 0.2—a terrible value. However, the OA, AA, and Kappa values of our MOE-U reached 0.5333, 0.5516, and 0.4845, respectively—the best among all of the optimizers. Compared with the single meta-optimizer (LSTM optimizer), our MOE-U increased the OA, AA, and Kappa values by 0.0467, 0.0368, and 0.048, respectively. Compared with our MOE-A, our MOE-U increased these values by 0.0274, 0.0053, and 0.0233, respectively. These results indicate that our improvement of meta-optimizers can effectively enhance the generalization ability in dealing with many unseen classes. The best classification maps are displayed in Figure 7.

4.3.4. PaviaC–SalinasA Task

The PaviaC dataset and the SalinasA dataset were from different sensors, so they were completely different in terms of the number of bands, spectral range, spatial resolution, and land-cover types. Figure 8 shows the averaged loss change curves, and Table 13 shows the final convergence values. All of the meta-optimizers also performed better in convergence speed and the final convergence value. The averaged final convergence value of MOE-U was 0.2822, which was lower than all other meta-optimizers.
The classification results are shown in Table 14. All of the meta-optimizers had better classification results than the human-designed optimizers, and MOE-U achieved the best results in OA, AA, and Kappa again. This demonstrates that the learned knowledge can easily be generalized to HSI datasets from different sensors. Since the training dataset and test dataset in this task were from different sensors, these results effectively demonstrate the robustness of our methods. Finally, the best classification maps are displayed in Figure 9.

4.3.5. SalinasA–PaviaC Task

In this task, we trained meta-optimizers on SalinasA and tested their classification performance on PaviaC. This task can further test the robustness of our method in the scenario where the source dataset and the target dataset are from different sensors. Figure 10 shows the averaged loss change curves, and the final convergence values are shown in Table 15. Based on the experimental results, we drew a similar conclusion that all of the meta-optimizers performed better than the human-designed optimizers in terms of convergence speed and the final loss, and our MOE-U also achieved the lowest loss among the meta-optimizers. The classification performance results are shown in Table 16, demonstrating again that the meta-optimizers can generalize well to other HSI datasets from different sensors. Our proposed MOE-U achieved 0.8445, 0.8112, and 0.7890 in OA, AA, and Kappa, respectively, which also outperformed all other optimizers. The best classification maps on PaviaC are displayed in Figure 11, while the best classification maps by the human-designed optimizers are shown in Figure 3.

4.3.6. PaviaC–KSC Task

The PaviaC dataset and the KSC dataset were also from different sensors. Due to significant differences in spatial resolution (1.3 m and 18 m, respectively), the meta-knowledge from one HSI dataset is very difficult to use in classifying another HSI dataset. So, this task can further test the robustness ability of our proposed method. The averaged loss change curves are shown in Figure 12, and the final averaged convergence values and their Stds are reported in Table 17. In this scenario, our MOE-U achieved the fastest convergence speed and the lowest convergence value of 1.5106. The losses of the human-designed optimizers were all significantly higher than those of the meta-optimizers.
The classification results are shown in Table 18. All of the meta-optimizers had better results than the human-designed optimizers, and MOE-U also achieved the best results in OA, AA, and Kappa again, which were 0.5361, 0.4278, and 0.4860, respectively. Among all of the human-designed optimizers, the highest values were 0.2757, 0.2131, and 0.2003, respectively, indicating that human-designed optimizers caused the network to fall into serious overfitting when the training samples were insufficient, thus failing to make the correct classification. The best classification maps are displayed in Figure 13.

4.3.7. KSC–PaviaC Task

For the KSC–PaviaC task, the averaged loss change curves are shown in Figure 14, and the final averaged convergence values and their Stds are reported in Table 19. In this scenario, all of the meta-optimizers had similar convergence speeds and final convergence values, which were also significantly better than those of the human-designed optimizers. The classification results are shown in Table 20. Our MOE-U achieved the highest OA, AA, and Kappa of 0.9024, 0.8388, and 0.8652, respectively. The best classification maps by the meta-optimizers are displayed in Figure 15. The best classification maps by the human-designed optimizers are shown in Figure 3.

5. Discussion

Manually labeling HSI samples is not only time-consuming and laborious but also requires much professional expertise, so HSI classification always faces the challenge of few-shot learning. The main difficulty is to improve the generalization ability of few-shot classifiers and avoid overfitting, where traditional human-designed optimizers (e.g., Adam, Adagrad, SGD) are widely recognized as being unable to achieve this aim. As an advance of deep learning, meta-learning becomes a powerful tool to deal with the issue of few-shot learning. It always run fast, since only very limited samples are used to train the meta-learning model, leading to low computational costs and memory requirements. In this study, we focused on the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land-cover classes to form the training set. We proposed an improvement of meta-optimizers through an ensemble of meta-optimizers and a novel update integration process. We performed periodic annealing on the learning rate during the meta-training process, leading to building an ensemble of meta-optimizers without adding training costs. The incorporation of the update integration process increased the computational costs during the iterative process but brought a significant enhancement in classification accuracy.
In seven classification tasks on five real-world HSI datasets, we trained meta-optimizers on the source HSI dataset and tested their performance on different datasets from the same or different sensors. The real-world HSI classification experimental results showed that, with the help of the learned knowledge, the meta-optimizers outperformed the human-designed optimizers in terms of convergence speed, final convergence value, OA, AA, and Kappa. Moreover, our proposed MOE-U performed better than the single meta-optimizer (LSTM optimizer) and MOE-A, proving that an ensemble of meta-optimizers could achieve better results using our proposed update integration algorithm. Multiple meta-optimizers in an ensemble contain more parameters than a single meta-optimizer, thus containing more useful knowledge. When optimizing the predictive model, multiple meta-optimizers in an ensemble have a higher probability of generating good updates. The proposed update integration algorithm is more likely to choose good updates. Finally, our MOE-U achieved the best results in all real-world HSI classification experiments.

6. Conclusions

When lacking training samples, the widely used human-designed optimizers, such as Adam, AdaGrad, RMSprop, SGD, and SGD with momentum, can cause the model to fall into severe overfitting. To solve this issue, we present a meta-optimizer ensemble for HSI classification, which learns prior knowledge from the source HSI dataset and then uses the learned knowledge to train the predictive model on the target HSI dataset with limited training samples. By combining the advantages of ensemble learning and meta-learning, the meta-optimizer ensemble is improved in terms of overall performance and generalization ability. Moreover, we propose an effective update integration algorithm to incorporate the candidate updates generated by the meta-optimizer ensemble into the final update. The experimental results on multiple few-shot HSI classification tasks demonstrate the superiority and effectiveness of our proposed methods. Compared with the widely used human-designed optimizers, our meta-optimizer ensemble makes the predictive model converge faster on target HSI datasets while achieving better OA, AA, and Kappa coefficient results. Except for HSI classification, our improvement of meta-optimizers has the potential to improve various few-shot learning models in broad fields, including numerical analysis, computation, medical imaging, industrial detection, and other applications.

Author Contributions

T.H. and Z.Z. are co-first authors. Conceptualization, Z.Z.; Methodology, T.H. and Z.Z.; Software, T.H.; Validation, T.H.; Formal analysis, Z.Z., T.H. and M.J.C.C.; Investigation, Z.Z.; Resources, T.H.; Writing—original draft, T.H. and Z.Z.; Writing—review & editing, Z.Z. and M.J.C.C. All authors have read and agreed to the published version of the manuscript.

Funding

The corresponding author was supported by the European Commission Horizon 2020 Framework Program No. 861584 and the Taishan Distinguished Professor Fund No. 20190910.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy reasons.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kumar, B.; Dikshit, O.; Gupta, A.; Singh, M.K. Feature extraction for hyperspectral image classification: A review. Int. J. Remote Sens. 2020, 41, 6248–6287. [Google Scholar] [CrossRef]
  2. Deng, B.; Jia, S.; Shi, D.M. Deep Metric Learning-Based Feature Embedding for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2020, 58, 1422–1435. [Google Scholar] [CrossRef]
  3. Li, S.T.; Song, W.W.; Fang, L.Y.; Chen, Y.S.; Ghamisi, P.; Benediktsson, J.A. Deep Learning for Hyperspectral Image Classification: An Overview. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6690–6709. [Google Scholar] [CrossRef]
  4. Ahmad, M.; Shabbir, S.; Roy, S.K.; Hong, D.F.; Wu, X.; Yao, J.; Khan, A.M.; Mazzara, M.; Distefano, S.; Chanussot, J. Hyperspectral Image Classification-Traditional to Deep Models: A Survey for Future Prospects. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 968–999. [Google Scholar] [CrossRef]
  5. Wang, C.Y.; Liu, B.H.; Liu, L.P.; Zhu, Y.J.; Hou, J.L.; Liu, P.; Li, X. A review of deep learning used in the hyperspectral image analysis for agriculture. Artif. Intell. Rev. 2021, 54, 5205–5253. [Google Scholar] [CrossRef]
  6. Govender, M.; Chetty, K.; Bulcock, H. A review of hyperspectral remote sensing and its application in vegetation and water resource studies. Water SA 2007, 33, 145–151. [Google Scholar] [CrossRef]
  7. Zhang, Z.H.; Huisingh, D. Combating desertification in China: Monitoring, control, management and revegetation. J. Clean Prod. 2018, 182, 765–775. [Google Scholar] [CrossRef]
  8. Chutia, D.; Bhattacharyya, D.; Sarma, K.K.; Kalita, R.; Sudhakar, S. Hyperspectral remote sensing classifications: A perspective survey. Trans. GIS 2016, 20, 463–490. [Google Scholar] [CrossRef]
  9. Hong, D.F.; Yokoya, N.; Ge, N.; Chanussot, J.; Zhu, X.X. Learnable manifold alignment (LeMA): A semi-supervised cross-modality learning framework for land cover and land use classification. ISPRS-J. Photogramm. Remote Sens. 2019, 147, 193–205. [Google Scholar] [CrossRef]
  10. Mishra, N.B.; Crews, K.A. Mapping vegetation morphology types in a dry savanna ecosystem: Integrating hierarchical object-based image analysis with Random Forest. Int. J. Remote Sens. 2014, 35, 1175–1198. [Google Scholar] [CrossRef]
  11. Ghamisi, P.; Plaza, J.; Chen, Y.S.; Li, J.; Plaza, A. Advanced Spectral Classifiers for Hyperspectral Images A review. IEEE Geosci. Remote Sens. Mag. 2017, 5, 8–32. [Google Scholar] [CrossRef]
  12. Cheng, G.; Guo, L.; Zhao, T.Y.; Han, J.W.; Li, H.H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 2013, 34, 45–59. [Google Scholar] [CrossRef]
  13. Martha, T.R.; Kerle, N.; van Westen, C.J.; Jetten, V.; Kumar, K.V. Segment Optimization and Data-Driven Thresholding for Knowledge-Based Landslide Detection by Object-Based Image Analysis. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4928–4943. [Google Scholar] [CrossRef]
  14. Yang, J.X.; Zhao, Y.Q.; Chan, J.C.W. Learning and Transferring Deep Joint Spectral-Spatial Features for Hyperspectral Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4729–4742. [Google Scholar] [CrossRef]
  15. Zhang, L.F.; Zhang, L.P.; Tao, D.C.; Huang, X. On Combining Multiple Features for Hyperspectral Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2012, 50, 879–893. [Google Scholar] [CrossRef]
  16. Chen, Y.; Nasrabadi, N.M.; Tran, T.D. Hyperspectral Image Classification Using Dictionary-Based Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2011, 49, 3973–3985. [Google Scholar] [CrossRef]
  17. Li, J.; Marpu, P.R.; Plaza, A.; Bioucas-Dias, J.M.; Benediktsson, J.A. Generalized Composite Kernel Framework for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2013, 51, 4816–4829. [Google Scholar] [CrossRef]
  18. Benediktsson, J.A.; Palmason, J.A.; Sveinsson, J.R. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Trans. Geosci. Remote Sens. 2005, 43, 480–491. [Google Scholar] [CrossRef]
  19. Jia, S.; Shen, L.L.; Li, Q.Q. Gabor Feature-Based Collaborative Representation for Hyperspectral Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1118–1129. [Google Scholar]
  20. Camps-Valls, G.; Bruzzone, L. Kernel-based methods for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2005, 43, 1351–1362. [Google Scholar] [CrossRef]
  21. Ham, J.; Chen, Y.C.; Crawford, M.M.; Ghosh, J. Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens. 2005, 43, 492–501. [Google Scholar] [CrossRef]
  22. Licciardi, G.; Marpu, P.R.; Chanussot, J.; Benediktsson, J.A. Linear Versus Nonlinear PCA for the Classification of Hyperspectral Data Based on the Extended Morphological Profiles. IEEE Geosci. Remote Sens. Lett. 2012, 9, 447–451. [Google Scholar] [CrossRef]
  23. Villa, A.; Benediktsson, J.A.; Chanussot, J.; Jutten, C. Hyperspectral Image Classification With Independent Component Discriminant Analysis. IEEE Trans. Geosci. Remote Sens. 2011, 49, 4865–4876. [Google Scholar] [CrossRef]
  24. Bandos, T.V.; Bruzzone, L.; Camps-Valls, G. Classification of Hyperspectral Images with Regularized Linear Discriminant Analysis. IEEE Trans. Geosci. Remote Sens. 2009, 47, 862–873. [Google Scholar] [CrossRef]
  25. Falco, N.; Benediktsson, J.A.; Bruzzone, L. Spectral and Spatial Classification of Hyperspectral Images Based on ICA and Reduced Morphological Attribute Profiles. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6223–6240. [Google Scholar] [CrossRef]
  26. Chen, Y.S.; Lin, Z.H.; Zhao, X.; Wang, G.; Gu, Y.F. Deep Learning-Based Classification of Hyperspectral Data. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
  27. Zhang, M.M.; Li, W.; Du, Q. Diverse Region-Based CNN for Hyperspectral Image Classification. IEEE Trans. Image Process. 2018, 27, 2623–2634. [Google Scholar] [CrossRef]
  28. Yu, S.Q.; Jia, D.; Xu, C.Y. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–98. [Google Scholar] [CrossRef]
  29. He, N.J.; Paoletti, M.E.; Haut, J.M.; Fang, L.Y.; Li, S.T.; Plaza, A.; Plaza, J. Feature Extraction With Multiscale Covariance Maps for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 755–769. [Google Scholar] [CrossRef]
  30. Mei, S.H.; Ji, J.Y.; Geng, Y.H.; Zhang, Z.; Li, X.; Du, Q. Unsupervised Spatial-Spectral Feature Learning by 3D Convolutional Autoencoder for Hyperspectral Classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 6808–6820. [Google Scholar] [CrossRef]
  31. Xue, Z.H.; Zhou, Y.Y.; Du, P.J. S3Net: Spectral-Spatial Siamese Network for Few-Shot Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5531219. [Google Scholar] [CrossRef]
  32. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
  33. Hospedales, T.; Antoniou, A.; Micaelli, P.; Storkey, A. Meta-Learning in Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5149–5169. [Google Scholar] [CrossRef]
  34. Wang, Y.Q.; Yao, Q.M.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar]
  35. Lee, Y.; Choi, S. Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 2927–2936. [Google Scholar]
  36. Gao, K.L.; Liu, B.; Yu, X.C.; Zhang, P.Q.; Tan, X.; Sun, Y.F. Small sample classification of hyperspectral image using model-agnostic meta-learning algorithm and convolutional neural network. Int. J. Remote Sens. 2021, 42, 3090–3122. [Google Scholar] [CrossRef]
  37. Li, X.R.; Cao, Z.Y.; Zhao, L.Y.; Jiang, J.F. ALPN: Active-Learning-Based Prototypical Network for Few-Shot Hyperspectral Imagery Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5508305. [Google Scholar] [CrossRef]
  38. Bottou, L. Large-Scale Machine Learning with Stochastic Gradient Descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Paris, France, 22–27 August 2010; pp. 177–186. [Google Scholar]
  39. Mukkamala, M.C.; Hein, M. Variants of RMSProp and Adagrad with Logarithmic Regret Bounds. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2545–2553. [Google Scholar]
  40. Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  41. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  42. Huisman, M.; van Rijn, J.N.; Plaat, A. A survey of deep meta-learning. Artif. Intell. Rev. 2021, 54, 4483–4541. [Google Scholar] [CrossRef]
  43. Andrychowicz, M.; Denil, M.; Colmenarejo, S.G.; Hoffman, M.W.; Pfau, D.; Schaul, T.; Shillingford, B.; de Freitas, N. Learning to learn by gradient descent by gradient descent. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  44. Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–11. [Google Scholar]
  45. Wang, S.P.; Sun, J.; Xu, Z.B. HyperAdam: A Learnable Task-Adaptive Adam for Network Training. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence/31st Innovative Applications of Artificial Intelligence Conference/9th AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 5297–5304. [Google Scholar]
  46. Tian, Y.J.; Zhao, X.X.; Huang, W. Meta-learning approaches for learning-to-learn in deep learning: A survey. Neurocomputing 2022, 494, 203–223. [Google Scholar] [CrossRef]
  47. Wichrowska, O.; Maheswaranathan, N.; Hoffman, M.W.; Colmenarejo, S.G.; Deni, M.; de Freitas, N.; Sohl-Dickstein, J. Learned Optimizers that Scale and Generalize. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 5744–5753. [Google Scholar]
  48. Zhenguo, L.; Fengwei, Z.; Fei, C.; Hang, L. Meta-SGD: Learning to Learn Quickly for Few Shot Learning. arXiv 2017, arXiv:1707.09835. [Google Scholar]
  49. Chen, Y.T.; Hoffman, M.W.; Colmenarejo, S.G.; Denil, M.; Lillicrap, T.P.; Botvinick, M.; de Freitas, N. Learning to Learn without Gradient Descent by Gradient Descent. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  50. Gao, H.; Yixuan, L.; Pleiss, G.; Zhuang, L.; Hopcroft, J.E.; Weinberger, K.Q. Snapshot Ensembles: Train 1, get M for free. arXiv 2017, arXiv:1704.00109. [Google Scholar]
Figure 1. The framework of our proposed method.
Figure 1. The framework of our proposed method.
Remotesensing 16 02988 g001
Figure 2. Loss changes on the PaviaC dataset.
Figure 2. Loss changes on the PaviaC dataset.
Remotesensing 16 02988 g002
Figure 3. Visual classification results on the PaviaC dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.9216), (d) AdaGrad (0.9069), (e) RMSprop (0.8965), (f) SGD (0.8253), (g) SGD with momentum (0.9291), (h) LSTM optimizer (0.9342), (i) MOE-A (0.9222), and (j) MOE-U (0.9408).
Figure 3. Visual classification results on the PaviaC dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.9216), (d) AdaGrad (0.9069), (e) RMSprop (0.8965), (f) SGD (0.8253), (g) SGD with momentum (0.9291), (h) LSTM optimizer (0.9342), (i) MOE-A (0.9222), and (j) MOE-U (0.9408).
Remotesensing 16 02988 g003aRemotesensing 16 02988 g003b
Figure 4. Loss changes on the PaviaU dataset.
Figure 4. Loss changes on the PaviaU dataset.
Remotesensing 16 02988 g004
Figure 5. Visual classification results on the PaviaU dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.6918), (d) AdaGrad (0.7050), (e) RMSprop (0.6417), (f) SGD (0.5290), (g) SGD with momentum (0.6475), (h) LSTM optimizer (0.6444), (i) MOE-A (0.6515), and (j) MOE-U (0.6583).
Figure 5. Visual classification results on the PaviaU dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.6918), (d) AdaGrad (0.7050), (e) RMSprop (0.6417), (f) SGD (0.5290), (g) SGD with momentum (0.6475), (h) LSTM optimizer (0.6444), (i) MOE-A (0.6515), and (j) MOE-U (0.6583).
Remotesensing 16 02988 g005
Figure 6. Loss changes on the Salinas dataset.
Figure 6. Loss changes on the Salinas dataset.
Remotesensing 16 02988 g006
Figure 7. Visual classification results on the Salinas dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.3873), (d) AdaGrad (0.4571), (e) RMSprop (0.3873), (f) SGD (0.2727), (g) SGD with momentum (0.3885), (h) LSTM optimizer (0.5837), (i) MOE-A (0.6765), and (j) MOE-U (0.6236).
Figure 7. Visual classification results on the Salinas dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.3873), (d) AdaGrad (0.4571), (e) RMSprop (0.3873), (f) SGD (0.2727), (g) SGD with momentum (0.3885), (h) LSTM optimizer (0.5837), (i) MOE-A (0.6765), and (j) MOE-U (0.6236).
Remotesensing 16 02988 g007
Figure 8. Loss changes on the SalinasA dataset.
Figure 8. Loss changes on the SalinasA dataset.
Remotesensing 16 02988 g008
Figure 9. Visual classification results on the SalinasA dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.7556), (d) AdaGrad (0.7467), (e) RMSprop (0.7609), (f) SGD (0.6605), (g) SGD with momentum (0.8708), (h) LSTM optimizer (0.9476), (i) MOE-A (0.9139), and (j) MOE-U (0.9277).
Figure 9. Visual classification results on the SalinasA dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.7556), (d) AdaGrad (0.7467), (e) RMSprop (0.7609), (f) SGD (0.6605), (g) SGD with momentum (0.8708), (h) LSTM optimizer (0.9476), (i) MOE-A (0.9139), and (j) MOE-U (0.9277).
Remotesensing 16 02988 g009
Figure 10. Loss changes on the PaviaC dataset.
Figure 10. Loss changes on the PaviaC dataset.
Remotesensing 16 02988 g010
Figure 11. Visual classification results on the PaviaC dataset: (a) LSTM optimizer (0.9407), (b) MOE-A (0.9060), and (c) MOE-U (0.9194).
Figure 11. Visual classification results on the PaviaC dataset: (a) LSTM optimizer (0.9407), (b) MOE-A (0.9060), and (c) MOE-U (0.9194).
Remotesensing 16 02988 g011
Figure 12. Loss changes on the KSC dataset.
Figure 12. Loss changes on the KSC dataset.
Remotesensing 16 02988 g012
Figure 13. Visual classification results on the KSC dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.4481), (d) AdaGrad (0.3373), (e) RMSprop (0.5296), (f) SGD (0.2810), (g) SGD with momentum (0.4123), (h) LSTM optimizer (0.6173), (i) MOE-A (0.5878), and (j) MOE-U (0.6319).
Figure 13. Visual classification results on the KSC dataset: (a) False-color image. (b) Ground-truth map. Classification maps obtained by (c) Adam (0.4481), (d) AdaGrad (0.3373), (e) RMSprop (0.5296), (f) SGD (0.2810), (g) SGD with momentum (0.4123), (h) LSTM optimizer (0.6173), (i) MOE-A (0.5878), and (j) MOE-U (0.6319).
Remotesensing 16 02988 g013
Figure 14. Loss changes on the PaviaC dataset.
Figure 14. Loss changes on the PaviaC dataset.
Remotesensing 16 02988 g014
Figure 15. Visual classification results on the PaviaC dataset: (a) LSTM optimizer (0.9351), (b) MOE-A (0.9401), and (c) MOE-U (0.9423).
Figure 15. Visual classification results on the PaviaC dataset: (a) LSTM optimizer (0.9351), (b) MOE-A (0.9401), and (c) MOE-U (0.9423).
Remotesensing 16 02988 g015
Table 1. Land-cover classes with the number of samples per class in the PaviaU dataset.
Table 1. Land-cover classes with the number of samples per class in the PaviaU dataset.
No.ClassNumber
C1Asphalt6631
C2Meadows18,649
C3Gravel2099
C4Trees3064
C5Painted metal sheets1345
C6Bare soil5029
C7Bitumen1330
C8Self-blocking bricks3682
C9Shadows947
Total42,776
Table 2. Land-cover classes with the number of samples per class in the PaviaC dataset.
Table 2. Land-cover classes with the number of samples per class in the PaviaC dataset.
No.ClassNumber
C1Water65,971
C2Trees7598
C3Asphalt3090
C4Self-blocking bricks2685
C5Bitumen6584
C6Tiles9248
C7Shadows7287
C8Meadows42,826
C9Bare soil2863
Total148,152
Table 3. Land-cover classes with the number of samples per class in the Salinas dataset.
Table 3. Land-cover classes with the number of samples per class in the Salinas dataset.
No.ClassNumber
C1Brocoli_green_weeds_12009
C2Brocoli_green_weeds_23726
C3Fallow1976
C4Fallow_rough_plow1394
C5Fallow_smooth2678
C6Stubble3959
C7Celery3579
C8Grapes_untrained11,271
C9Soil_vinyard_develop6203
C10Corn_senesced_green_weeds3278
C11Lettuce_romaine_4wk1068
C12Lettuce_romaine_5wk1927
C13Lettuce_romaine_6wk916
C14Lettuce_romaine_7wk1070
C15Vinyard_untrained7268
C16Vinyard_vertical_trellis1807
Total54,129
Table 4. Land-cover classes with the number of samples per class in the SalinasA dataset.
Table 4. Land-cover classes with the number of samples per class in the SalinasA dataset.
No.ClassNumber
C1Brocoli_green_weeds_1391
C2Corn_senesced_green_weeds1343
C3Lettuce_romaine_4wk616
C4Lettuce_romaine_5wk1525
C5Lettuce_romaine_6wk674
C6Lettuce_romaine_7wk799
Total5348
Table 5. Land-cover classes with the number of samples per class in the KSC dataset.
Table 5. Land-cover classes with the number of samples per class in the KSC dataset.
No.ClassNumber
C1Scrub761
C2Willow-swamp243
C3CP-hammock256
C4Slash-pine252
C5Oak-broadleaf161
C6Hardwood229
C7Swap105
C8Graminoid-marsh431
C9Spartina-marsh520
C10Cattail-marsh404
C11Salt-marsh419
C12Mud-flats503
C13Water927
Total5211
Table 6. Comparison of five real-world HSI datasets.
Table 6. Comparison of five real-world HSI datasets.
PaviaUPaviaCSalinasSalinasAKSC
Pixel resolution 610 × 340 1096 × 715 512 × 217 83 × 86 512 × 614
Labeled pixels42,776148,15254,12953485211
Number of bands103102204204176
Spectral range (nm)430–860430–860400–2500400–2500400–2500
SensorROSISROSISAVIRISAVIRISAVIRIS
Number of classes9916613
Spatial resolution (m)1.31.33.73.718
Table 7. The loss function values of different optimizers on the PaviaC dataset.
Table 7. The loss function values of different optimizers on the PaviaC dataset.
OptimizerFinal Convergence Value
MeanStd
Adam 0.80230.2267
AdaGrad 0.91330.1145
RMSprop 0.88240.1323
SGD 1.4450.1436
SGD with momentum0.80190.2074
LSTM optimizer0.43670.1149
MOE-A0.38190.1135
MOE-U0.37860.1128
Table 8. Classification accuracy by different optimizers on the PaviaC dataset.
Table 8. Classification accuracy by different optimizers on the PaviaC dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.97630.03680.88940.29450.91570.22230.61610.46330.99150.00740.99280.00250.98760.01020.98450.0119
C20.76920.28010.68980.32110.44840.33070.64520.41080.39570.42590.80550.07330.79050.09910.81460.0716
C30.61180.38000.67860.33670.85070.16390.39340.47040.69140.44640.84080.13770.85900.08790.83040.0940
C40.64090.32620.70020.25340.73150.18760.45380.43620.69880.24730.81420.10480.75480.21620.79860.0792
C50.20260.27080.26200.32780.31510.34340.35280.37850.48220.22060.40250.19300.53110.2060.60890.1466
C60.81380.12620.39460.37040.72930.35890.46970.42900.66270.31590.86360.11850.89680.07830.87550.0885
C70.57690.32190.88870.05890.57990.32470.54570.42660.83680.09760.79980.04180.73650.14920.80570.0723
C80.74900.36310.63690.32840.60970.39320.32850.38310.46910.39070.71030.29440.73870.22960.78430.1785
C90.99420.00940.98960.01800.99430.01030.60750.44270.80240.31190.99790.00180.99250.01390.98920.015
OA0.82250.0980.74150.13530.74520.17170.50240.27190.74400.12390.85140.08880.86040.06860.87910.0503
AA0.70390.07740.68110.06480.68600.05690.49030.08180.67010.08760.80300.04430.80970.03990.83240.017
Kappa0.75880.12190.66490.15040.67370.17980.42500.25180.66120.15380.79830.11190.80910.08730.83400.0649
Table 9. The convergence values of different optimizers on the PaviaU dataset.
Table 9. The convergence values of different optimizers on the PaviaU dataset.
OptimizerFinal Convergence Value
MeanStd
Adam 1.0480.1693
AdaGrad 1.1640.1103
RMSprop 1.1480.1414
SGD 1.5510.1254
SGD with momentum 1.1810.3798
LSTM optimizer0.71530.1213
MOE-A0.66860.1155
MOE-U0.66650.1627
Table 10. Classification accuracy by different optimizers on the PaviaU dataset.
Table 10. Classification accuracy by different optimizers on the PaviaU dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.41300.28680.46140.34050.45770.31460.18440.39620.58250.37580.59070.15570.55710.17410.56260.1163
C20.47000.25040.52030.28120.58210.18230.21080.32380.55100.25710.51960.19290.53080.15370.59920.0603
C30.15380.32340.41750.42790.10150.2320.37300.41990.20030.25350.43300.25870.43530.18650.27190.2183
C40.86140.28350.96410.02920.96640.02960.86090.28910.88360.18580.93540.03370.92790.04560.9120.0553
C50.89330.29750.99200.00680.98860.01090.78510.39320.89290.29770.99030.00370.99100.00580.99280.0032
C60.51140.27270.31240.21400.27150.19190.11210.20190.38440.22950.46280.18100.47820.22360.44010.0901
C70.34560.34300.13630.30390.50560.48190.09940.29670.43310.40780.77830.21270.77050.14850.78160.2667
C80.44910.34400.31120.41210.36320.37900.26550.34660.34270.42460.46450.36800.57960.23030.68140.2203
C90.99660.00330.99820.00160.99870.00130.99920.00090.99590.00460.99970.00050.99760.00340.99670.0037
OA0.49780.10720.50890.14040.53110.09280.28640.13740.54190.07080.57820.07070.58890.05140.61510.0322
AA0.56600.09490.56820.04610.58170.04160.43230.0820.58510.07750.68600.02330.69640.02910.69310.0439
Kappa0.39350.10660.40570.12790.42110.09280.19910.10070.43400.06940.48510.06700.49590.05000.51950.0379
Table 11. The convergence values of different optimizers on the Salinas dataset.
Table 11. The convergence values of different optimizers on the Salinas dataset.
OptimizerFinal Convergence Value
MeanStd
ADAM 2.1760.3160
AdaGrad 2.1550.1448
RMSprop 2.1980.1502
SGD 2.4630.1173
SGD with momentum2.0530.2727
LSTM optimizer1.4860.1639
MOE-A1.3730.2306
MOE-U1.2550.1442
Table 12. Classification accuracy by different optimizers on the Salinas dataset.
Table 12. Classification accuracy by different optimizers on the Salinas dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.39980.48960.79840.39920.78580.39420.45170.46470.7990.39950.44780.45520.77250.38890.56750.4257
C20.29600.45150.39030.38590.29290.42370.30820.45160.00030.00080.31790.42850.38120.31460.58320.3635
C30.10000.3000.07330.17180.15580.31880.28190.43200.24230.39900.20520.25810.43400.35080.17570.2333
C40.10240.29930.57220.46580.50420.47600.24530.39540.21010.39540.96670.06830.83560.28520.89610.2988
C50.20000.39990.00020.00060.11270.29750.00010.00010.10470.29430.58800.38330.38800.46550.57820.3781
C60.46790.46440.76530.38950.66390.40330.12720.29480.69570.41440.89790.13510.95670.07600.95520.0667
C70.29300.43340.01590.03220.25710.39460.10620.29720.30780.44980.61050.38530.74490.26580.47940.4007
C80.19940.29070.37090.38320.20090.32660.11820.26020.13730.21150.46200.33030.34870.32020.60700.2226
C90.19970.39950.35670.44300.11300.29230.30070.45680.2090.39630.47130.47460.42050.40540.57910.4731
C100.07840.21180.08000.21240.06960.18750.00250.00760.08940.20320.19510.23430.33510.27930.20830.2136
C110.00000.00000.20070.39970.29810.45540.0380.11220.00000.00000.41750.42870.31860.3790.47510.4109
C120.19960.39920.16200.33470.26480.40750.10000.30000.17780.3590.17070.30270.30130.26530.24100.3297
C130.27460.3830.33500.39990.11660.21210.04910.14740.30280.34950.81680.28410.82450.28430.87620.1684
C140.15740.24910.53890.42280.17940.31220.0980.14770.37130.37860.67260.24990.52090.37320.67000.2840
C150.20890.28080.05980.16940.23840.33180.27290.40010.10340.21810.43130.30080.53380.32360.29580.1900
C160.21740.30490.15710.30690.13490.14360.01740.02450.19510.25120.56610.22050.62420.15300.63790.1198
OA0.22430.11350.29320.09530.25550.08110.17440.06740.21750.08390.48630.06990.50560.09680.53330.0525
AA0.21220.07860.30480.06750.27430.05080.15730.04290.24660.09120.51480.05870.54630.08950.55160.0469
Kappa0.16780.10240.24030.08570.20100.07440.11180.05770.16790.08100.43650.07210.46120.09970.48450.0536
Table 13. The convergence values of different optimizers on the SalinasA dataset.
Table 13. The convergence values of different optimizers on the SalinasA dataset.
OptimizerFinal Convergence Value
MeanStd
Adam0.83400.2434
AdaGrad0.89050.1416
RMSprop 0.91890.1337
SGD1.1450.1298
SGD with momentum0.83150.1449
LSTM optimizer0.38720.1678
MOE-A0.41870.1039
MOE-U0.28220.1047
Table 14. Classification accuracy of the predictive model optimized by different optimizers on the SalinasA dataset.
Table 14. Classification accuracy of the predictive model optimized by different optimizers on the SalinasA dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.99510.00080.99510.00080.99590.00130.99540.00100.99590.00170.99510.00080.99490.00000.99490.0000
C20.0850.11020.22350.19970.13770.12030.21370.26280.20510.19800.47780.23050.46400.18640.68270.1349
C30.17580.31390.62160.40670.69270.36770.53050.47590.77130.30980.84610.16020.87370.10040.69300.1686
C40.75270.39730.52010.42350.55300.39580.36050.44080.87360.18680.73080.35180.80890.24860.89120.1300
C50.79360.39690.85220.29450.66250.41580.45430.45290.87420.25880.99500.00350.99640.00220.99470.0033
C60.90990.13170.96270.03160.95840.05350.93690.08280.94460.07680.96370.02390.95940.02850.96250.0309
OA0.56490.15040.6000.10660.57160.12520.48760.08430.71360.10140.76800.13550.78950.09890.84730.067
AA0.61870.13010.69590.07720.66670.09440.58190.06090.77740.09810.83480.08990.84960.05820.86980.0558
Kappa0.46620.16890.51700.12060.48820.13340.38690.07990.64930.12660.71970.15790.74340.11560.81040.0830
Table 15. The convergence values of different optimizers on the PaviaC dataset.
Table 15. The convergence values of different optimizers on the PaviaC dataset.
OptimizerFinal Convergence Value
MeanStd
Adam 0.80230.2267
AdaGrad0.91330.1145
RMSprop 0.88240.1323
SGD 1.4450.1436
SGD with momentum 0.80190.2074
LSTM optimizer0.48030.1362
MOE-A0.49240.1215
MOE-U0.42460.1088
Table 16. Classification accuracy by different optimizers on the PaviaC dataset.
Table 16. Classification accuracy by different optimizers on the PaviaC dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.97630.03680.88940.29450.91570.22230.61610.46330.99150.00740.99020.00550.98390.00970.98020.0173
C20.76920.28010.68980.32110.44840.33070.64520.41080.39570.42590.77490.08160.77910.14180.78620.0875
C30.61180.38000.67860.33670.85070.16390.39340.47040.69140.44640.88470.09200.72780.27820.87090.0699
C40.64090.32620.70020.25340.73150.18760.45380.43620.69880.24730.75860.20770.70910.26240.76700.1321
C50.20260.27080.26200.32780.31510.34340.35280.37850.48220.22060.57370.2090.56410.21970.55390.1709
C60.81380.12620.39460.37040.72930.35890.46970.4290.66270.31590.82750.22940.81790.13270.88150.1079
C70.57690.32190.88870.05890.57990.32470.54570.42660.83680.09760.78120.06490.75540.13240.78260.0747
C80.74900.36310.63690.32840.60970.39320.32850.38310.46910.39070.65060.27390.64350.3060.68630.2401
C90.99420.00940.98960.01800.99430.01030.60750.44270.80240.31190.99590.00730.98860.0170.99190.0131
OA0.82250.0980.74150.13530.74520.17170.50240.27190.74400.12390.83580.08590.82450.09270.84450.0679
AA0.70390.07740.68110.06480.68600.05690.49030.08180.67010.08760.80420.05840.77440.05930.81120.0286
Kappa0.75880.12190.66490.15040.67370.17980.4250.25180.66120.15380.77810.11010.76380.11630.78900.0861
Table 17. The convergence values of different optimizers on the KSC dataset.
Table 17. The convergence values of different optimizers on the KSC dataset.
OptimizerFinal Convergence Value
MeanStd
Adam 2.18050.2329
AdaGrad2.29950.1549
RMSprop 2.15490.2984
SGD 2.71600.6359
SGD with momentum 2.40100.2103
LSTM optimizer1.67930.3079
MOE-A1.60020.3768
MOE-U1.51060.2350
Table 18. Classification accuracy by different optimizers on the KSC dataset.
Table 18. Classification accuracy by different optimizers on the KSC dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.27710.41550.10000.30000.17600.35590.00000.00000.09950.29840.58110.36680.46220.35660.58820.3479
C20.19750.39510.20330.39850.14490.31430.19220.38480.00000.00000.24280.31880.30820.25970.45760.2739
C30.09840.29530.29840.45590.00000.00000.10080.29970.09920.29770.09100.24880.01840.04170.21480.3460
C40.01350.04050.17220.34850.09840.29390.00000.00000.38890.47640.00320.00950.13020.25910.10870.2182
C50.16650.33950.00000.00000.07700.23110.20000.40000.00000.00000.01930.05570.62110.37390.16270.2360
C60.08250.24760.10000.30000.38030.46680.09830.29480.20000.40000.24500.34980.06940.11360.18430.2201
C70.00000.00000.00000.00000.19810.39620.10000.30000.00000.00000.30860.38830.19140.25730.41140.4295
C80.20440.30520.10900.26730.00070.00210.00070.00210.05010.15030.13870.15830.23230.18600.20560.2708
C90.12380.29930.10960.29820.44150.44970.10000.30000.09940.29830.54060.31320.55330.26150.44790.3298
C100.01060.03190.25990.39380.09830.28830.00000.00000.00270.00570.29380.30360.33740.26480.34430.2510
C110.24770.38950.04820.14380.04270.12820.10000.30000.10000.30000.46870.43380.64180.33890.78880.2629
C120.57500.37280.18610.27370.51530.38630.10260.29850.39720.43060.46760.26530.64630.24870.65310.2041
C130.59750.48790.79310.39660.59770.48810.59950.48950.59490.48450.99650.00390.99540.00620.99370.0075
OA0.27570.09900.25450.08500.27150.12030.16100.08420.21350.10320.46670.07160.50680.07050.53610.0883
AA0.19960.03640.18310.04130.21310.07810.12260.03840.15630.04490.33820.0510.40060.07170.42780.0592
Kappa0.19740.09120.18030.07690.20030.12220.08510.07230.14020.10020.40490.07330.45550.07630.48600.094
Table 19. The convergence values of different optimizers on the PaviaC dataset.
Table 19. The convergence values of different optimizers on the PaviaC dataset.
OptimizerFinal Convergence Value
MeanStd
Adam 0.77340.1765
AdaGrad0.90390.0941
RMSprop 0.91330.1145
SGD 1.42720.1492
SGD with momentum 0.76340.2265
LSTM optimizer0.34630.1168
MOE-A0.39410.1040
MOE-U0.37100.1002
Table 20. Classification accuracy by different optimizers on the PaviaC dataset.
Table 20. Classification accuracy by different optimizers on the PaviaC dataset.
ClassAdamAdaGradRMSpropSGDSGD with MomentumLSTM OptimizerMOE-AMOE-U
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
C10.97630.03680.88940.29450.91570.22230.61610.46330.99150.00740.99410.00290.98620.01210.98900.0075
C20.76920.28010.68980.32110.44840.33070.64520.41080.39570.42590.78590.07080.79210.11340.82350.0661
C30.61180.38000.67860.33670.85070.16390.39340.47040.69140.44640.88190.09240.81910.17320.82700.1230
C40.64090.32620.70020.25340.73150.18760.45380.43620.69880.24730.82280.10170.81240.05500.81600.0768
C50.20260.27080.26200.32780.31510.34340.35280.37850.48220.22060.51860.19520.55330.13820.55140.2171
C60.81380.12620.39460.37040.72930.35890.46970.4290.66270.31590.88740.04490.88900.09810.85540.1217
C70.57690.32190.88870.05890.57990.32470.54570.42660.83680.09760.81680.03760.82080.02600.82380.0404
C80.74900.36310.63690.32840.60970.39320.32850.38310.46910.39070.73730.26580.76780.22660.86480.2156
C90.99420.00940.98960.01800.99430.01030.60750.44270.80240.31190.99900.00130.99160.01170.99820.0041
OA0.82250.0980.74150.13530.74520.17170.50240.27190.74400.12390.86730.07540.87320.06800.90240.0590
AA0.70390.07740.68110.06480.68600.05690.49030.08180.67010.08760.82710.02760.82580.02610.83880.0267
Kappa0.75880.12190.66490.15040.67370.17980.4250.25180.66120.15380.81930.09620.82680.08720.86520.0753
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hao, T.; Zhang, Z.; Crabbe, M.J.C. Few-Shot Hyperspectral Remote Sensing Image Classification via an Ensemble of Meta-Optimizers with Update Integration. Remote Sens. 2024, 16, 2988. https://doi.org/10.3390/rs16162988

AMA Style

Hao T, Zhang Z, Crabbe MJC. Few-Shot Hyperspectral Remote Sensing Image Classification via an Ensemble of Meta-Optimizers with Update Integration. Remote Sensing. 2024; 16(16):2988. https://doi.org/10.3390/rs16162988

Chicago/Turabian Style

Hao, Tao, Zhihua Zhang, and M. James C. Crabbe. 2024. "Few-Shot Hyperspectral Remote Sensing Image Classification via an Ensemble of Meta-Optimizers with Update Integration" Remote Sensing 16, no. 16: 2988. https://doi.org/10.3390/rs16162988

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop