We show that, for each of five datasets of increasing complexity, certain training samples are more informative of class membership than others. These samples can be identified a priori to training by analyzing their position in reduced dimensional space relative to the classes’ centroids. Specifically, we demonstrate that for the datasets studied, samples that are nearer to their classes’ centroids are less informative than those that are furthest from them. For all five datasets studied, we show that there is no statistically significant difference between training on the entire training set and when excluding up to 2% of the data nearest to each class’s centroid.

The authors would like to acknowledge the support provided by the Cognitively-inspired Agile Information and Knowledge Modelling (CALM) project, funded by the United States Air Force Office of Scientific Research.
In Figs. 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35 and 36 the values of each of the 3 dimensions of the reductions for each individual class of the training data for MNIST and Imagenette are plotted. The first thing that can be learned from looking at the MNIST plots is that the reduction for each class produces values significantly different than the others and further, the values in each dimension of each class are tightly grouped in the number line, with the noticeable exception of the third dimension of the class that represents the digit 1. It is worth noting that the stylization of the Hindu-Arabic numeral ’1’ contains most of its information in 2-dimensions (accounting for translation) thus providing an explanation for the larger variance in the third dimension. The plots of the Imagenette classes, on the other hand, are much more similar to one another and display greater variance in each of the 3 dimensions.
