[go: up one dir, main page]

Academia.eduAcademia.edu
SEN: A Novel Feature Normalization Dissimilarity Measure for Prototypical Few-Shot Learning Networks Van Nhan Nguyen1,2 , Sigurd Løkse1 , Kristoffer Wickstrøm1 , Michael Kampffmeyer1 , Davide Roverso2 , and Robert Jenssen1 1 UiT Machine Learning Group, UiT The Arctic University of Norway {sigurd.lokse,kristoffer.k.wickstrom,michael.c.kampffmeyer,robert.jenssen}@uit.no 2 Analytics Department, eSmart Systems, 1783 Halden, Norway {nhan.v.nguyen,Davide.Roverso}@esmartsystems.com Abstract. In this paper, we equip Prototypical Networks (PNs) with a novel dissimilarity measure to enable discriminative feature normalization for few-shot learning. The embedding onto the hypersphere requires no direct normalization and is easy to optimize. Our theoretical analysis shows that the proposed dissimilarity measure, denoted the Squared root of the Euclidean distance and the Norm distance (SEN), forces embedding points to be attracted to its correct prototype, while being repelled from all other prototypes, keeping the norm of all points the same. The resulting SEN PN outperforms the regular PN with a considerable margin, with no additional parameters as well as with negligible computational overhead. 1 Introduction Few-shot classification [8, 23, 19, 17, 6, 20] aims at adapting a classifier to previously unseen classes from just a handful of labeled examples per class. In the past few years, many approaches to few-shot classification have been proposed. These approaches can be roughly categorized as (i) learning to fine-tune approaches [6, 17]; (ii) sequence-based approaches [1, 13]; (iii) generative modeling-based approaches [29, 26]; (vi) (deep) distance metric learning-based approaches [19, 22, 27, 20]; and (v) semi-supervised approaches [3, 16]. Among these categories, distance metric learning-based approaches are typically preferred because of their simplicity and effectiveness. The basic idea of these approaches, for which the so-called Prototypical Networks (PNs) [19] are the most well-known examples, is to learn a non-linear mapping of the input into an embedding space which is commonly high-dimensional. In this space, a metric distance is defined which maps similar examples close to each other in the embedding space. Dissimilar examples are mapped to distant locations relative to each other, so that a query example can be classified by, for example, using nearest neighbor methods. Arguably one of the most commonly used distance metrics in this high dimensional embedding space is the (squared) Euclidean distance combined with a softmax function [19, 27, 3, 16]. 2 V.N. Nguyen et al. However, even though the softmax is known to work well for closed-set classification problems, it has been shown to not be discriminative enough in problems were there are few labels relative to the number of classes [4, 15]. This has given rise to alternative loss formulations with improved discriminative ability, where high-dimensional features have been normalized explicitly to lie on a hypersphere via direct L2 normalization [4, 15, 24]. The advantage of normalization has been theoretically analyzed in [30]. However, direct L2 normalization leads to a nonconvex loss formulation, which typically results in local minima generated by the loss function itself [30]. With the aim of performing soft feature normalization while preserving the convexity and the simplicity of the loss function, we equip PNs with a novel dissimilarity measure particularly suited to enable discriminative feature normalization for few-shot learning, without any direct normalization. The proposed dissimilarity measure, denoted the Squared root of the Euclidean distance and the Norm distance (SEN), replaces the Euclidean distance in PN training, with major consequences: Our theoretical analysis shows that the proposed measure explicitly forces embedded points to be attracted to the correct prototype and repelled from incorrect prototypes. Further, we provide analysis showing that SEN indeed explicitly forces all embeddings to have the same norm during training which enables the resulting SEN PN to generate a more robust embedding space. With this minimal but important modification, the SEN PN outperforms the original PN by a considerable margin and demonstrates good performance on the Mini-Imagenet [17, 23], the Fewshot-CIFAR100 (FC100) [14], and the Omniglot [9] datasets with no additional parameters as well as negligible computational overhead (a comparison of inference time is provided in the supplementary material). We furthermore experimentally show that the proposed SEN dissimilarity measure constantly outperforms the Euclidean distance in PNs with different embedding sizes as well as with different embedding networks. 2 Related Work The literature on few-shot learning is vast; we present in this section a short summary of well-known approaches and works most relevant to our proposed approach. We refer the reader to [25] and [21] for more detailed reviews on few-shot learning. Besides distance metric learning-based approaches, few-shot learning approaches can be categorized into (i) learning to fine-tune approaches; (ii) sequencebased approaches; (iii) generative modeling-based approaches; (iv) (deep) distance metric learning-based approaches; and (v) semi-supervised approaches. Learning to fine-tune approaches aim at learning a model’s initial parameters such that it can be quickly adapted to a new task through only one or a few gradient update steps [6, 17]. These approaches typically can handle many model representations; however, they suffer from the need to fine-tune on the target problem, which makes them less appealing to few-shot learning. Sequence-based approaches formalize few-shot learning as a sequence-to-sequence problem and SEN: A Novel Feature Normalization Dissimilarity Measure 3 leverage Recurrent Neural Networks (RNNs) with memories to address the problem [1, 13]. While appealing, these methods typically require complex RNN architectures and complicated mechanisms for storing/retrieving all the historical information of relevance, both long-term and short-term, without forgetting [20]. Generative modeling-based approaches employ adversarial training to produce additional signals/training examples to allow the classification algorithm to learn a better classifier [29, 26]. Deep distance metric learning-based approaches aim at eliminating the need for manually choosing the right distance metric (e.g., the Euclidean distance and the cosine distance) by learning not only a deep embedding network but also a deep non-linear metric (similarity function) for comparing images in the embedding space [20]. Although deep distance metric learning-based approaches can avoid the need for manually choosing the right distance metric, they are prone to overfitting and are more difficult to train compared to distance metric learning-based approaches due to the added parameters. Semi-supervised approaches utilize unlabeled data to improve fewshot learning accuracy. This is typically achieved by casting the semi-supervised few-shot learning problem as a semi-supervised clustering problem and address it by applying, for example, k-means clustering algorithms [3, 16]. We build on the distance metric learning line of work due to its simplicity and effectiveness. Metric learning-based approaches aim to learn a non-linear mapping of the input into an embedding space and define a metric distance which maps similar examples close and dissimilar ones distant in the embedding space, so that a query example can be easily classified by, for example, using nearest neighbor methods. Some notable approaches in this line of work include Koch et al. [8], who propose to learn siamese neural networks for computing the pair-wise distance between samples. The learned distance is then used by a nearest neighbor classifier for solving the one-shot learning problem. Vinyals et al. [23] define an end-to-end differentiable nearest neighbor classifier, called matching networks, based on the cosine similarity between the support set and the query example. Snell et al. [19] propose a simple method called prototypical networks for fewshot learning based on the assumption that there exists an embedding space in which samples from each class cluster around a single prototype representation, which is simply the mean of the individual samples. Garcia and Bruna [22] argue that few-shot learning, which aims at propagating label information from labeled support examples towards unlabeled query images, can be formalized as a posterior inference over a graphical model determined by the images and labels in the support set and the query set. The authors cast posterior inference as message passing on graph neural networks and propose a graph-based model, which can be trained end-to-end, to solve the task. Wang et al. [27] propose to improve the generalization capacity of metric-based methods for few-shot learning by enforcing a large margin between the class centers. This is achieved by augmenting a large margin loss function, which is the unnormalized triplet loss [18], to the standard softmax loss function for classification. 4 3 V.N. Nguyen et al. Few-shot Learning In this section, we first begin by detailing the general few-shot learning task. Next, we introduce PNs and the Euclidean distance function with special attention paid to highlight its existing challenges. Then, we describe our proposed SEN dissimilarity measure and our SEN PN model. Finally, we provide analyses on the gradient of the SEN PN’s loss function and the behavior of the proposed SEN dissimilarity measure during training. 3.1 Task Description In the traditional machine learning setting, we are typically given a dataset D. This dataset is usually split into two parts: Dtrain and Dtest . The former is often used for training the parameters θ of the model, while the latter is typically used for evaluating its generalization. In general few-shot learning, we are dealing with meta-datasets Dmeta containing multiple regular datasets D [17]. Each dataset D ∈ Dmeta has a split of Dtrain and Dtest ; however, they are usually much smaller than that of regular datasets used in the traditional machine learning setting. Let C = {1, . . . , K} be the set of all classes available in Dmeta . The set C is usually split into two disjoint sets: Ctrain containing training classes and Ctest containing unseen classes for testing, i.e., Ctrain ∩ Ctest = ∅. The meta-dataset Dmeta is often split into two parts: The first is a meta training set th example, Dmeta−train = {(xi , yi )}N i=1 , where xi is the feature vector of the i yi ∈ Ctrain is its corresponding label, and N is the number of training examples. The second part is a meta testing set Dmeta−test . In a standard M-way K-shot classification task, the meta testing set Dmeta−test consists of a support set and S a query set. The support set S = {(xj , yj )}N j=1 contains K examples from each of the M classes from Ctest , i.e., the number of support examples are NS = M × K NS +NQ and yj ∈ Ctest . The query set contains NQ unlabeled examples Q = {xj }j=N . S +1 The support set is employed by the model for learning the new task, while the query set is utilized by the model for evaluating its performance. 3.2 Prototypical Networks Prototypical networks learn a non-linear embedding function fφ : RD → − RE parameterized by φ that maps a D-dimensional feature vector of an example xi to an E-dimensional embedding zi = fφ (xi ) [19]. In meta-testing, the embedding function fφ is employed for mapping examples in the support set S S = {(xj , yj )}N j=1 into the embedding space. An E-dimensional representation ck , or prototype, of each class is computed by taking the mean of the embedded support points belonging to the class: X X 1 1 fφ (xi ) = zi , (1) ck = |Sk | |Sk | (xi ,yi )∈Sk (xi ,yi )∈Sk where Sk is the support set of class k. An embedded query point xq is then classified by simply finding the nearest class prototype in the embedding space. SEN: A Novel Feature Normalization Dissimilarity Measure 5 To train PNs, the episodic training strategy proposed in [23, 17] is adopted. In particular, to train a PN for the M-way, K-shot classification task, a training episode is formed from the meta training set Dmeta train as follows: K examples from each of M randomly selected classes from Ctrain are sampled to form a M support set S = {Si }M i=1 . A query set Q = {Qi }i=1 is formed by sampling from the rest of the M classes’ samples. Next, for each class k, its support set Sk ∈ S is used for computing a prototype using Equation 1. Then, a distribution over classes for each query point xq ∈ Q based on a softmax over distances to the prototypes in the embedding space is produced: pφ (y = k|xq ) = P exp(−d(fφ (xq ), ck )) , ′ k′ exp(−d(fφ (xq ), ck )) (2) where d = RE × RE → − [0, +∞) is a distance function. Based on that, the PN is trained by minimizing the negative log-probability of the true class k via Stochastic Gradient Descent (SGD): M 1 X 1 X log pφ (y = k|xq ). J(φ) = − M |Qk | k=1 (3) xq ∈Qk The training is repeated with new, randomly generated training episodes until a stopping criterion is met. PNs employ the squared Euclidean distance as the distance metric. The squared Euclidean distance between two arbitrary points z = (z1 , . . . , zn ) and c = (c1 , . . . , cn ) is defined as follows: dse (z, c) = kz − ck2 = n X (zi − ci )2 . (4) i=1 Although combining the softmax and the Euclidean distance has shown to give good performance for closed-set classification settings, it performs suboptimally when few labels are available relative to the number of classes. In order to address this issue and improve the discriminative ability, new loss formulations based on feature normalization have been proposed. These tend to normalize features explicitly via L2 normalization [15, 24, 4]. This typically results in a more compact embedding space than the Euclidean embedding space. In such an embedding space, the cosine distance is commonly chosen as the distance metric and many few-shot classification approaches [23, 17] have employed the cosine distance in the hyperspherical embedding space. The cosine distance between two arbitrary point z = (z1 , . . . , zn ) and c = (c1 , . . . , cn ) is defined as: Pn zi c i z·c pPn = 1 − pPn i=1 . (5) dcs = 1 − 2 2 kzkkck i=1 zi i=1 ci However, feature normalization through hard normalization operations such as L2 normalization leads to a non-convex loss formulation, which typically results in local minima introduced by the loss function itself [30]. Since the net- 6 V.N. Nguyen et al. work optimization itself is non-convex, it is important to preserve convexity in loss functions for more effective minimization. One possible solution is to use Ring loss [30]. The Ring loss introduces an additional term to the primary loss function, which penalizes the squared difference between the norm of samples and a learned target norm value R. The modified loss function is defined as follows: L = LP + γLR , (6) where γ is the loss weight w.r.t to the primary loss LP and LR is the Ring loss, which is defined as: n 1 X (kfφ (xi )k − R)2 . (7) LR = 2n i=1 Since the Ring loss encourages the norm of samples being value R during training instead of explicit enforcing through a hard normalization operation, the convexity in the loss function is preserved. However, the Ring loss is more difficult to train than the primary loss (e.g., the Softmax loss) due to the added term (the norm difference LR ), the added parameter (the target norm R), and the added hyperparameter (the loss weight w.r.t to the primary loss γ). To address the shortcomings outlined above, we propose a novel dissimilarity measure for few-shot learning, called SEN. The SEN dissimilarity measure encourages the norm of samples to have the same value, in other words, force the data to lie on a scaled unit hypersphere, while preserving the convexity and the simplicity of the loss function. 3.3 SEN Dissimilarity Measure for Prototypical Networks The SEN dissimilarity ds (z, c) between two arbitrary points z = (z1 , . . . , zn ) and c = (c1 , . . . , cn ) in D-dimensional space is a combination of the standard squared Euclidean distance de and the squared norm distance dn : p (8) ds (z, c) = de (z, c) + ǫdn (z, c), where ǫ is a tunable balancing hyperparameter and must be chosen such that de (z, c) + ǫdn (z, c) is always positive, de (z, c) and dn (z, c) are defined as: de (z, c) = kz − ck2 , dn (z, c) = (kzk − kck)2 . We modify the PN by replacing the Euclidean distance by our proposed SEN dissimilarity measure. We call this model SEN PN. Specifically, we replace the distance function p d(zi , ck ) in Equation 2 by our proposed SEN dissimilarity measure ds (zi , ck ) = de (zi , ck ) + ǫdn (zi , ck ), zi is the embedding of the example xi , and ck is the prototype of class k. For simplicity, we consider the setting in which only one query example per class is used; however, the loss function presented in this session and the analysis presented in the next section can be easily SEN: A Novel Feature Normalization Dissimilarity Measure 7 generalized for other settings in which more than one query examples per class are used. When only one query example per class is used, the updated negative log probability loss is given as: J(φ) = − X log pφ (yi = k|xi ) k =− X k = X log P exp(−ds (zi , ck )) k′ exp(−ds (zi , ck′ )) ds (zi , ck ) + log X exp(−ds (zi , ck′ )) . k′ k (9) ! The learning proceeds by minimizing J(φ) of the true class k via SGD, which is equivalent to minimizing the SEN dissimilarity measure between the query example xi and its prototype ck : ds (zi , ck ), and maximizing the SEN dissimilarity measures between the query example xi and the other prototypes ck′ : ds (zi , ck′ ). Minimizing ds (zi , ck ) pulls zi to its own class and encourages embeddings of the same class to have the same norm. Maximizing ds (zi , ck′ ) pushes zi away from other classes; however it encourages embeddings of different classes to have different norms. Since our goal is to force the data to lie on a scaled unit hypersphere, we define the balancing hyperparameter ǫ relative to zi and ck as follows: ǫik = ( ǫp > 0 ǫn < 0 if yi = k , if yi = 6 k (10) where i is the index of the embedding zi , yi is the embedding’s class label, and k is the class label of the prototype ck . During training, a positive epsilon (ǫik = ǫp > 0) is used for computing the SEN dissimilarity measure between the query example xi and its prototype ck , while a negative epsilon (ǫik = ǫn < 0) is used for computing the SEN dissimilarity measures between the query example xi and the other prototypes ck′ . The negative epsilon ǫn will inverse the effect of the norm distance when maximizing ds (zi , ck′ ). In other words, maximizing ds (zi , ck′ ) with a negative epsilon ǫn pushes zi away from other classes and encourages embeddings of all classes to have the same norm. The flexibility induced by the balancing hyperparameter ǫik makes the SEN particularly suited to enable discriminative feature normalization in PNs. Our proposed SEN dissimilarity measure explicitly encourages the norm of samples to have the same value during training, while preserving the convexity and the simplicity of the loss function. At test time, a positive epsilon (ǫik = ǫp > 0) is used for computing all dissimilarity measures. In the next section, we provide a theoretical analysis showing that our proposed SEN dissimilarity measure together with the special balancing hyperparameter ǫik explicitly pulls the data to a scaled unit hypersphere during training. 8 3.4 V.N. Nguyen et al. Theoretical analysis The partial derivative of the negative log probability loss J(φ) with respect to ds (zi , ck ) is given by: X ∂J(φ) = (1[yi = k] − pφ (yi = k|x)), ∂ds (zi , ck ) (11) k where the Iverson bracket indicator function [yi = k] evaluates to 1 when yi = k and 0 otherwise. The partial derivative of the SEN dissimilarity measure ds (zi , ck ) with respect to zi is given by: p ∂ de (zi , ck ) + ǫik dn (zi , ck ) ∂ds (zi , ck ) = ∂zi ∂zi (zi − ck ) + ǫik (kzi k − kck k) kzzii k = ds (zi , ck ) (12) (ck − zi ) + ǫik (kck k − kzi k) kzzii k =− ds (zi , ck ) v(zi , ck ) , =− ds (zi , ck ) where v(zi , ck ) = (ck − zi ) + ǫik (kck k − kzi k) zi . kzi k (13) Using the chain rule, we get: ∂J(φ) ∂ds (zi , ck ) ∂J(φ) = ∂zi ∂ds (zi , ck ) ∂zi X 1[yi = k] − pφ (yi = k|x) v(zi , ck ) − = ds (zi , ck ) k X ∂Jk (φ) = . ∂zi (14) k Thus, there is a gradient contribution from all prototypes. In particular, the gradient contribution with respect to the correct prototype, when k = k ∗ = yi , is given by: 1 − pφ (yi = k ∗ |x) ∂Jk∗ (φ) =− v(zi , ck∗ ) ∂zi ds (zi , ck∗ ) 1 − pφ (yi = k ∗ |x) =− vp (zi , ck∗ ), ds (zi , ck∗ ) (15) where zi . vp (zi , ck∗ ) = (ck∗ − zi ) + ǫik∗ (kck∗ k − kzi k) | {z } kzi k | {z } attractor norm equalizer (16) SEN: A Novel Feature Normalization Dissimilarity Measure 9 The gradient contribution with respect to incorrect prototypes, when k = k ′ 6= yi , is given by: pφ (yi = k ′ |x) 0 − pφ (yi = k ′ |x) ∂Jk′ (φ) v(zi , ck′ ) = − vn (zi , ck′ ), =− ∂zi ds (zi , ck′ ) ds (zi , ck′ ) (17) where zi vn (zi , ck′ ) = (zi − ck′ ) + ǫik′ (kzi k − kck′ k) . | {z } kzi k {z } | repeller (18) norm equalizer From the preceding analysis, we observe the following: 1. Each gradient component contains an attractor/repeller, which encourages zi to move towards the correct prototype and move away from the incorrect ones. 2. From (16), it is clear that if kck∗ k > kzi k and ǫik∗ > 0, ǫik∗ (kck∗ || − kzi k) kz1i k > 0, such that kzi k is encouraged to increase (and vice verca for kzi k > kck∗ k). 3. Conversely, from (18), if kck′ k > kzi k and ǫik′ > 0, ǫik′ (kzi k−kck′ k) kz1i k < 0 (and vice verca for kzi k > kck′ k). Thus, we need ǫik′ < 0 in order to ensure similar behaviors as with the correct prototype. Observation 2) and 3) shows that the gradient contributions with respect to the correct prototype and the incorrect ones cooperate in order to equalize the norms during training when ǫik∗ > 0 and ǫik′ < 0. 4 Experiments To evaluate the effectiveness of the proposed SEN dissimilarity measure, we compare our proposed SEN PN approach with the original PN [19] and stateof-the-art distance metric learning-based approaches on the Mini-Imagenet [17, 23] and the Omniglot [9] dataset. Further, additional ablation studies are also performed on the Fewshot-CIFAR100 (FC100) [14] dataset. 4.1 Experimental Setup and Results Embedding networks We utilize the same embedding network as that used by the original PN. Specifically, our network, which we refer to as 4CONV, comprises of four convolutional blocks. Each block is composed of 64 3 × 3 convolutional filters, a batch normalization layer, a ReLU nonlinearity, and a 2×2 max-pooling layer. To test the performance of the SEN dissimilarity measure in more general settings, we employ a more sophisticated network, the Wide Residual Network (WRN) [28], as the embedding network. We use the same network architecture proposed in [3], which is a network of depth 16 and a widening factor of 6. We train the network with both the traditional Euclidean distance (WRN PN) and the SEN dissimilarity measure (SEN WRN PN). 10 V.N. Nguyen et al. Model Network Original PN [19] 4CONV Large Margin GNN [27] 4CONV Large Margin PN [27] 4CONV RN [20] 4CONV Matching Nets [23] 4CONV MetaGAN + RN [29] 4CONV Semi-Supervised PN [3] 4CONV PN (ours, baseline) 4CONV SEN PN (ours) 4CONV Supervised WRN PN [3] WRN Semi-Supervised WRN PN [3] WRN WRN PN (ours) WRN SEN WRN PN (ours) WRN Omniglot 98.9% 99.2% 98.7% 99.1% 98.7% 99.2% 98.6% 98.8% 99.2% 99.4% Mini-Imagenet 68.2% 67.6% 66.8% 65.3% 60.0% 68.6% 65.5% 67.8% 69.8% 69.6% 70.9% 71.0% 72.3% Table 1. Few-shot classification accuracy. Hyperparameter ǫ For SEN-based models, during training, ǫp = 1.0 is used for computing the SEN between the query example and its prototype, while ǫn = −10−7 to compute the SEN between the query example and the other prototypes. During testing, ǫp = 1.0 is used for computing all the SEN dissimilarity measures. A discussion on how the hyperparameters ǫp and ǫn were chosen can be found in the supplementary. Results The test results are shown in Table 1. As can be seen from Table 1, although our implementation of the PN (the baseline model) achieves 0.4 percentage points lower in terms of accuracy compared to the original implementation of the PN (67.8% vs 68.2%), the baseline model trained with the proposed SEN dissimilarity measure still outperforms the original PN by obtaining a relative increase of 2.4% and achieves an accuracy of 69.8%. In addition, the SEN WRN PN outperforms the Semi-Supervised WRN PN by a relative increase of 2% and achieves an accuracy of 72.3% with the WRN as the embedding network. Similar trends can be observed for the Omniglot dataset, where SEN PN outperforms our PN implementation and SEN WRN PN outperforms WRN PN. 4.2 Ablation Study To investigate the effectiveness and behavior of the proposed SEN dissimilarity measure, we conduct several ablation studies. First, we compare against the PN trained with the Euclidean distance (PN), the PN trained with the Ring loss (Ring PN), and the PN trained with the SEN dissimilarity measure (SEN PN). The test results are show in Table 2. We train the Ring PN with different values of γ, the loss weight w.r.t to the primary loss, in range [10−10 , 1] and pick γ = 10−7 since it results in the highest accuracy. R was learned during training SEN: A Novel Feature Normalization Dissimilarity Measure Model PN Ring PN SEN PN Omniglot 98.6% 98.7% 98.8% Mini-Imagenet 67.8% 68.6% 69.8% 11 FC100 52.4% 52.8% 54.6% Table 2. Few-shot classification accuracy on the Omniglot [9] (20-way 5-shot), the Mini-Imagenet [17, 23] (5-way 5-shot), and the FC100 [14] (5-way 5-shot) datasets. Fig. 1. 2D embeddings produced by the PN (left), the Ring PN (middle) and the SEN PN (right). The circles denote query examples, and the stars denotes prototypes. following [30]. As can be seen from Table 2, the Ring loss improves the accuracy relative to the PN on the Mini-Imagenet dataset by 1.8%; however, it performs worse than our proposed SEN PN approach, which obtains a relative increase of 3%. Similar behavior is obtained for other few-shot learning datasets such as FC100 and Omniglot. A more thorough discussion on SEN PN vs Ring PN can be found in the supplementary. Principal Component Analysis (PCA) We project 1600D embeddings produced by the PN, the Ring PN, and the SEN PN to 2D space using PCA and visualize the outputs (see Figure 1). As can be seen from Figure 1, the Ring loss forces the prototypes to lie on a scaled unit hypersphere; however, the prototypes produced by the Ring PN are not very well-separated compared to the ones produced by the PN. On the other hand, our proposed SEN dissimilarity measure both forces the prototypes to lie on a scaled unit hypersphere and keeps them well-separated. Analysis of norm We plot the norm of embeddings produced by the PN, the Ring PN, and the SEN PN. As can be seen from Figure 2, the norm of embeddings produced by the PN and the Ring PN vary a lot, while the norm of embeddings produced by the SEN PN has a very consistent value. This confirms that SEN encourages all embeddings to have the same norm during training. Both the SEN and the Ring loss are adopted for explicitly enforcing their embeddings to have the same norm during the training of the PN. However, as can be seen from Figure 2, the proposed SEN dissimilarity measure is a better choice for the task than the Ring loss. This is partly due to the use of a very small gamma (γ = 10−7 ) during training the Ring PN. In our experiments, higher gamma values do encourage the norm of embeddings to have a more consistent value; however, they cause a considerable decrease in the accuracy of the PN. This suggests that the Ring loss is not an optimal choice for enforcing feature normalization in PNs. 12 V.N. Nguyen et al. Fig. 2. The norm of embeddings produced by the PN (left), the Ring PN (middle), and the SEN PN (right). The stars denote query examples, and the diamonds denotes prototypes. Fig. 3. The PN vs the SEN PN with different embedding sizes. The proposed SEN dissimilarity measure; on the other hand, both encourages all embeddings to have the same norm and improves the accuracy of PNs. This indicates that the proposed SEN dissimilarity measure is a more suitable choice for feature normalization than the Ring loss in training PNs. Analysis of embedding dimensionality We compare between the PN and the SEN PN trained with different embedding sizes (see Figure 3). As can be seen from Figure 3, in low dimensional spaces, the PN and the SEN PN perform very similarly; however, in high dimensional spaces, the SEN PN consistently outperforms the PN by a considerable margin. This suggests that the SEN dissimilarity measure is a more suitable distance metric for metric distance learning-based fewshot learning than the standard Euclidean distance in high dimensional spaces. This further explains the limited improvement on the Omniglot dataset where the embedding size is 64 compared to 1600 for the remaining datasets. Analysis of distance We evaluate the possibility of combining the proposed SEN dissimilarity measure with other distance functions such as the Euclidean distance and the cosine distance in training PNs. Specifically, we train the PN with the SEN dissimilarity measure and test the trained model with both the Euclidean distance and the cosine distance. We compare the two tested models with the original PN, the SEN PN, and the Cosine PN (the PN trained and tested with the cosine distance). The test results are show in Table 3. SEN: A Novel Feature Normalization Dissimilarity Measure Train distance Cosine Cosine Euclidean Euclidean SEN SEN SEN Test distance Cosine SEN Euclidean SEN SEN Euclidean Cosine Omniglot 61.5% 55.2% 98.6% 98.7% 98.8% 98.8% 98.8% Mini-Imagenet 53.3% 51.4% 67.8% 68.5% 69.8% 68.8% 69.8% 13 FC100 44.9% 43.8% 52.4% 53.1% 54.6% 53.9% 54.6% Table 3. Test results of the PN with different distances on the Omniglot [9] (20-way 5shot), the Mini-Imagenet [17, 23] (5-way 5-shot), and the Fewshot-CIFAR100 (FC100) [14] (5-way 5-shot) datasets. Fig. 4. 2D embeddings produced by the Siamese Baseline (left), the Siamese Ring (middle), and the Siamese SEN (right). As can be seen from Table 3, the model trained with the SEN dissimilarity measure achieves the highest accuracy on the Mini-Imagenet, the FC100, and the Omniglot datasets when tested with either the SEN dissimilarity measure or the cosine distance. This is because the SEN dissimilarity measure explicitly forces all embeddings to have the same norm during training, and, as a result, pulling the prototypes very close to the hypersphere. For data embedded on a hypersphere, the cosine distance is a natural measure of distance [5, 2]. Experiments and discussions on alternative design choices for SEN can be found in the supplementary. SEN beyond few-shot learning We have demonstrated that the SEN dissimilarity measure outperforms the commonly used Euclidean distance in distance metric learning-based few-shot learning with prototypical networks. In this section, we study the behaviors of the proposed SEN in combination with other metric learning-based tasks, which are based on the idea of obtaining inter-class separability and intra-class compactness. Note, due to the lack of prototypes, the SEN distance is here computed between datapoints directly. To do this, we implement the well-known Siamese network and Contrastive loss [7]. We call this model the Siamese Baseline. We augment it by replacing the Euclidean distance by our proposed SEN dissimilarity measure (Siamese SEN) and by employing Ring loss (Siamese Ring). We train the three models on the MNIST dataset [10] 14 V.N. Nguyen et al. for dimensionality reduction and clustering. During training the Siamese SEN, following the reasoning of Section 3.3, a positive epsilon (ǫik = ǫp > 0) is used for computing the SEN dissimilarity measures between examples of the same class, and a negative epsilon (ǫik = ǫn < 0) is used for computing the SEN dissimilarity measures between examples of different classes. At test time, a positive epsilon (ǫik = ǫp > 0) is used for computing all dissimilarity measures. As can be seen from Figure 4, the Siamese Ring forces all embeddings to lie on a scaled unit hypersphere; however, embeddings produced by the Siamese Ring are not as well-separated as embeddings produced by the Siamese Baseline. Our proposed SEN dissimilarity measure, on the other hand, both forces all embeddings to lie on a scaled unit hypersphere and keeps the embeddings well-separated. This suggests that SEN can also be used beyond the field of fewshot learning where distance metric learning is used and class memberships are available. In future work, other promising lines of research are to combine feature normalization with weight normalization techniques [11] and analyze their synergy, as well as to analyze the potential of SEN in other prototype-based methods [12]. 5 Conclusion In this paper, we propose a novel dissimilarity measure, called SEN, for distance metric learning-based few-shot learning by modifying the traditional Euclidean distance to attenuate the curse of dimensionality in high dimensional spaces. The SEN is a combination of the Euclidean distance and the norm distance. We extend the prototypical network by replacing the Euclidean distance by our proposed SEN dissimilarity measure, which we refer to as SEN PN. With minimal modifications, the SEN PN outperforms the original PN by a considerable margin and demonstrates good performance on the Mini-Imagenet, the FC100, and the Omniglot datasets with no additional parameters as well as negligible computational overhead. We provide analyses showing that the proposed SEN dissimilarity measure encourages the embeddings to have the same norm and enables the SEN PN to generate a hyperspherical embedding space, which is a more compact embedding space than the Euclidean space. We experimentally show that the proposed SEN dissimilarity measure consistently outperforms the Euclidean distance in PNs with different embedding sizes as well as with different embedding networks. We also show that SEN is an effective feature normalization technique not only for distance metric learning-based few-shot learning with PNs but also potentially for more general tasks, here exemplified by the Siamese network. SEN: A Novel Feature Normalization Dissimilarity Measure 15 References 1. Adam, S., Sergey, B., Matthew, B., Daan, W., Timothy, P.L.: One-shot learning with memory-augmented neural networks. CoRR abs/1605.06065 (2016), http://arxiv.org/abs/1605.06065 2. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von mises-fisher distributions. The Journal of Machine Learning Research 6, 1345–1382 (2005) 3. Boney, R., Ilin, A.: Semi-supervised few-shot learning with prototypical networks. CoRR abs/1711.10856 (2017), http://arxiv.org/abs/1711.10856 4. Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2019) 5. Dhillon, I.S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Data mining for scientific and engineering applications, pp. 357–381. Springer (2001) 6. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1126–1135. JMLR. org (2017) 7. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 1735–1742. IEEE (2006) 8. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop. vol. 2 (2015) 9. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the annual meeting of the cognitive science society. vol. 33 (2011) 10. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010), http://yann.lecun.com/exdb/mnist/ 11. Liu, W., Lin, R., Liu, Z., Liu, L., Yu, Z., Dai, B., Song, L.: Learning towards minimum hyperspherical energy. In: Advances in neural information processing systems. pp. 6222–6233 (2018) 12. Mettes, P., van der Pol, E., Snoek, C.: Hyperspherical prototype networks. In: Advances in Neural Information Processing Systems. pp. 1485–1495 (2019) 13. Nikhil, M., Mostafa, R., Xi, C., Pieter, A.: A simple neural attentive meta-learner. CoRR abs/1707.03141 (2017), http://arxiv.org/abs/1707.03141 14. Oreshkin, B.N., Rodriguez, P., Lacoste, A.: Tadam: task dependent adaptive metric for improved few-shot learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp. 719–729. Curran Associates Inc. (2018) 15. Ranjan, R., Castillo, C.D., Chellappa, R.: L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507 (2017) 16. Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., Larochelle, H., Zemel, R.S.: Meta-learning for semi-supervised few-shot classification. CoRR abs/1803.00676 (2018), http://arxiv.org/abs/1803.00676 17. Sachin, R., Hugo, L.: Optimization as a model for few-shot learning. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings (2017), https://openreview.net/forum?id=rJY0-Kcll 16 V.N. Nguyen et al. 18. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015. pp. 815–823 (2015). https://doi.org/10.1109/CVPR.2015.7298682, https://doi.org/10.1109/CVPR.2015.7298682 19. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 4080–4090. Curran Associates Inc. (2017) 20. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1199–1208. IEEE (2018) 21. Vanschoren, J.: Meta-learning: A survey. CoRR abs/1810.03548 (2018), http://arxiv.org/abs/1810.03548 22. Victor, G., Joan, B.: Few-shot learning with graph neural networks. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=BJj6qGbRW 23. Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., Wierstra, D.: Matching networks for one shot learning. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. pp. 3637–3645. Curran Associates Inc. (2016) 24. Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: L2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 1041–1049. ACM (2017) 25. Wang, Y., Yao, Q.: Few-shot learning: A survey. CoRR abs/1904.05046 (2019), http://arxiv.org/abs/1904.05046 26. Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7278–7286. IEEE (2018) 27. Yong, W., Xiao-Ming, W., Qimai, L., Jiatao, G., Wangmeng, X., Lei, Z., Victor, O.K.L.: Large margin few-shot learning. CoRR abs/1807.02872 (2018), http://arxiv.org/abs/1807.02872 28. Zagoruyko, S., Komodakis, N.: Wide residual networks. CoRR abs/1605.07146 (2016), http://arxiv.org/abs/1605.07146 29. Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., Song, Y.: Metagan: an adversarial approach to few-shot learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. pp. 2371–2380. Curran Associates Inc. (2018) 30. Zheng, Y., Pal, D.K., Savvides, M.: Ring loss: Convex feature normalization for face recognition. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5089–5097. IEEE (2018)