Animal Classification Using Facial Images With Score-Level Fusion
Animal Classification Using Facial Images With Score-Level Fusion
Animal Classification Using Facial Images With Score-Level Fusion
Research Article
Turkey
E-mail: onsen.toygar@emu.edu.tr
Abstract: A real-world animal biometric system that detects and describes animal life in image and video data is an emerging
subject in machine vision. These systems develop computer vision approaches for the classification of animals. A novel method
for animal face classification based on score-level fusion of recently popular convolutional neural network (CNN) features and
appearance-based descriptor features is presented. This method utilises a score-level fusion of two different approaches; one
uses CNN which can automatically extract features, learn and classify them; and the other one uses kernel Fisher analysis
(KFA) for its feature extraction phase. The proposed method may also be used in other areas of image classification and object
recognition. The experimental results show that automatic feature extraction in CNN is better than other simple feature
extraction techniques (both local- and appearance-based features), and additionally, appropriate score-level combination of
CNN and simple features can achieve even higher accuracy than applying CNN alone. The authors showed that the score-level
fusion of CNN extracted features and appearance-based KFA method have a positive effect on classification accuracy. The
proposed method achieves 95.31% classification rate on animal faces which is significantly better than the other state-of-the-art
methods.
Fig. 5 LHI-Animal-Faces dataset. Five images are shown for each category
and training animal images is assumed to be the score of that and histogram equalisation in pre-processing step for local and
sample. In the next step, we normalised these obtained scores and appearance-based features but for CNN features we only resized
fused them together. Finally, we make decision by using nearest the image to 224 × 224 for VGG16 and 227 × 227 for AlexNet
neighbour (NN) classifier which uses the normalised fused scores. architectures. We investigated different type of feature descriptors
for both local feature-based and appearance-based classifier. We
5 Experiments and results used HOG, completed local binary patterns (CLBP), local binary
pattern histogram Fourier (LBP-HF), Haralick features and
Several experiments have been carried out in order to show the MRELBP as local feature descriptors and tested LDA and KFA as
performance of the proposed method and the other state-of-the-art appearance-based feature extractors. In each method, we used
methods. In the following subsections, the dataset description and cross-validation approach to find the optimal parameter settings.
experimental setup and results are presented. After feature extraction, we used a linear support vector machine
(SVM) for classification. The accuracy of each individual method
5.1 Dataset is demonstrated in Table 1.
Additionally, we used CNN as an automatic feature extractor.
The LHI-Animal-Faces dataset [13] consists of 19 classes of We selected publicly available CNNs, AlexNet and VGG-16
animal head images plus one class of human head images with architectures, pre-trained on the ImageNet database [18] for image
overall 2200 images. Fig. 5 shows five sample images from each of classification. Our choice is motivated by the impressive results
these categories. In contrast with other general classification achieved using these two models on the ILSVRC [16] and these
datasets, LHI-Animal-Faces contain only animal or human faces, two models in pre-trained version are freely accessible. These
which have a large intra-classes similarities (due to evolutional models are trained over hundreds of thousands of different images
relationship, some animal face categories are similar to the other and can separate an image into 1000 pre-defined classes of
class) and large inter-class variations (rotation, posture variation, different objects such as car, bike, airplane and many other object
subtypes). categories. As a result, these models have learned powerful
discriminative feature sets which can be used in the other object
5.2 Experimental setup and results recognition tasks.
In all of the following experiments, the LHI-Animal-Faces dataset The CNN architectures of AlexNet and VGG-16 contain
is divided into a training and test set in the same way as in the millions of parameters. Directly learning so many parameters from
AND-OR template (AOT) [32] method. We use 30 animal images only a few hundred training images from LHI-Animal dataset is
of each classes as a training set, and the remaining images of each problematic. There are several solutions for this problem. Our first
class for testing examples. solution to this problem is using a pre-trained CNN on a large
In order to eliminate the negative effect of different factors such image collection like ImageNet database, and then re-use it on
as size, illumination and picture quality, all the images have been LHI-Animal dataset. In order to use the representational power of
resized to 60 × 60 pixels and perform pixel intensity normalisation pre-trained deep networks, 4096-dimensional features are extracted
from the activations of FC7 layer. The reason that we selected
682 IET Comput. Vis., 2018, Vol. 12 Iss. 5, pp. 679-685
© The Institution of Engineering and Technology 2018
Table 1 Classification accuracy of different methods on LHI-Animal-Faces dataset
Type of methods Method Accuracy, %
local feature descriptor methods HOG 66.54
LBP 61.74
CLBP [33] 63.59
Fourier-LBP [34] 50.29
Haralick feature 49.27
BIF 68.46
median robust CLBP (MRCLBP) 68.46
appearance-based feature descriptor methods LDA 60.33
KFA 69.87
CNN features FC7 AlexNet features 89.91
FC7 VGG-16 features 92.84
Fine-tuned AlexNet 91.06
Fine-tuned VGG-16 94.39
score-level fusion methods LDA + HOG 74.32
LDA + LBP 68.91
LDA + CLBP 70.23
LDA + Fourier-LBP 62.44
LDA + Haralick feature 61.59
LDA + BIF 77.26
LDA + MRCLBP 76.30
LDA + FC7 AlexNet features 90.61
LDA + FC7 VGG-16 features 93.77
KFA + HOG 76.48
KFA + LBP 74.19
KFA + CLBP 74.14
KFA + Fourier-LBP 72.65
KFA + Haralick feature 70.94
KFA + MRCLBP 78.98
KFA + FC7 AlexNet features 91.37
KFA + FC7 VGG-16 features 94.21
KFA + FC7 fine-tuned AlexNet features 92.86
proposed method (KFA + FC7 fine-tuned VGG-16 features) 95.31
activations of FC7 layer for feature extraction is that based on resize the training images to 256 × 256 and then we select four
Donahue's results [35] the activations of the earlier layers can different patches with size 224 × 224 for VGG-16 and 227 × 227
extract mid-level image features and activations of the late layers for AlexNet by using the four image corners, and use them as the
can be used as very powerful features for many classification input of corresponding CNN architecture.
applications. In the next step, we rain a linear multiclass SVM on As the first step of fine-tuning, we truncate the last layer
the extracted features (softmax layer) of the VGG-16 and AlexNet pre-trained networks
In order to extract features from an image, the image is resized and replace them with our new softmax layer with 20 outputs or
and fed to the CNN as a multidimensional array of pixel intensities. categories instead of 1000 categories. We freeze the weight for the
The input image size of VGG-16 net is 224 × 224 × 3 matrix, while first early layers so that they remain intact throughout the fine-
for the input image of AlexNet architecture is a 227 × 227 × 3 tuning process. Then fine-tune the network by using stochastic
matrix. As the experiment results in Table 1 show, the classifier gradient descent algorithm to minimise the loss function with a
trained using AlexNet features provides close to 90% accuracy and small initial learning rate of 0.001.
the classifier trained using VGG-16 features provide close to 93% We believed that appropriate score-level combination of CNN
accuracy which are higher than the accuracy achieved using the and other simple features can achieve even higher accuracy than
hand-crafted features [36] such as LBP and HOG. The reason that applying CNN alone. In order to show that score-level fusion can
VGG-16 outperforms AlexNet is that VGG-16 architecture is much improve the accuracy of classification system, we tested different
deeper than the AlexNet, with 16 layers in total, 13 convolutional type of feature descriptors, namely local feature-based, appearance-
and three fully connected layers. VGG-16 uses small convolutional based and CNN-based features. We used HOG, CLBP, LBP-HF,
filters of 3 × 3 pixels so each filter captures simpler geometrical Haralick features and MRELBP as local feature descriptors, tested
structures but in comparison allows more complex reasoning LDA and KFA as appearance-based feature extractors and used
through its increased depth. FC-7 activation features of AlexNet and VGG-16 as CNN-based
Another solution for the problem is fine-tuning the pre-trained feature. We tested score-level fusion of different combination of
CNN by using our limited dataset. However, since our dataset is these features and some of the best combinations are mentioned in
too small, fine-tuning the pre-trained network on a small dataset Table 1. This procedure is detailed below.
might lead to overfitting, especially since the last few layers of the The distance between each test sample and its nearest training
VGG and AlexNet network are fully connected layers. In order to samples is assumed to be the score of that test sample in the
avoid overfitting, we increase the number of training samples with corresponding classifier. These scores are normalised by min–max
the common data augmentation strategies such as translation, normalisation [37] method as follows:
rotation, flipping. Training on the augmented dataset will make the
resulting model more robust and less prone to overfitting. We x − min(x)
x′ = (1)
increase the size of the training set five times by performing max x − min x
random rotations and translations on each image sample. Also, we
where x is the raw score, max x and min x are the maximum and that uses score-level fusion with FC7 activation feature of fine-
minimum values of the raw scores, respectively and x′ is the tuned pre-trained VGG-16 and KFA, achieves 95.31%
normalised score. classification rate which is better than all the other methods
The multimodal score vector x1, x2 can be constructed after presented in that table. Therefore, it can be stated that the proposed
score normalisation, with x1 and x2 corresponding to the method outperforms the other local and appearance-based methods
normalised scores of two different systems. The next step is fusion and all the possible combination pairs of these methods with score-
at the matching score level. The score vector is combined by sum level fusion.
rule-based fusion method [37] to generate a single scalar score In order to show the classification accuracy for each class, we
which is then used to make the final decision as follows: computed confusion matrices for both cases of fine-tuned VGG-16
alone and score-level fusion of VGG-16 and KFA as shown in
f s = w1 x1 + w2 x2 (2) Fig. 6. The classification accuracy of seven classes is 100% and for
many other classes, it is acceptable and near to 100%. In the fine-
The notation wi stands for the weight which is assigned to one of tuned VGG-16 confusion matrix, the maximum confusions are
the two systems and reflects the relative importance of the two caused by deer head and rabbit head versus dog head (12%). These
systems. In Table 1, w1 is the weight of the first mentioned method confusion values are reduced to 0 and 4% in the score-level fusion
in the (method1 + method2) syntax. We used grid-search algorithm of fine-tuned VGG-16 and KFA, respectively. In the proposed
to find the optimal value for w1 and w2 by giving different values method, the maximum confusions are caused by bear head versus
pigeon head and rabbit head versus mouse head (8%). In the fine-
between 0, 1 and w1 = 1 − w2, for all the experiments with score-
tuned VGG-16, 11 classes have confusion with dog head class
level fusion method. The optimal values for the proposed method
which reduced to 4 classes in the Score-level fusion of fine-tuned
are w1 = 0.3 and w2 = 0.7 which show the importance of CNN
VGG-16 and KFA.
system over KFA method. On the other hand, the proposed method is compared with the
The experimental results show that in all the cases, the score- other state-of-the-art methods that presented the results on LHI-
level fusion causes meaningful improvement in accuracy. Table 1 Animal-Faces dataset. Table 2 exhibits the classification accuracy
shows the classification accuracy of different feature descriptor of our proposed method and state-of-the-art methods on LHI-
methods individually and their score-level fusion results on LHI- Animal-Faces. It is clearly seen that our proposed method
Animal-Faces dataset. As shown in the table, the proposed method,