1. Introduction
Facial recognition has emerged as an important area in various fields, including image processing, cognitive sciences, computer vision, machine learning, and pattern recognition, driven by its application in bio-metric verification technology. The human face can be recognized from a video stream or a static image using human face databases containing many different individuals. While humans have the natural ability to recognize faces with ease, machines struggle to do so due to the presence of complex features within each face, making every individual distinct from one another [
1]. In recent decades, machine learning and deep learning models have shown promising results in automatic face recognition. This process is critical for tasks such as identifying terrorists, military personnel, monitoring general surveillance, identifying criminals, border security, and immigration control.
Due to its significant applications, facial recognition has been a focal point of research for many years. The automatic recognition of faces is a complex task due to various challenges such as illumination, pose, expressions, occlusion, and demographic differences such as gender and race [
2]. A wide range of deep learning-based methods [
3,
4] have been developed for face recognition. However, an effective facial recognition system must proficiently extract the key features that minimize both inter-class variation and intra-class variation between the subjects. Despite considerable progress, managing these variations, particularly inter-class variations such as age, appearances, racial features, and gender, as well as intra-class variations like illumination, disguise, expressions, and occlusion, remains a challenge. While deep learning has shown promising results in various domains, including facial recognition, training these models requires extensive datasets, posing considerable challenges.
The key issue for deep learning-based facial recognition systems is the scarcity of training samples. Often, there is only a single training sample per person (SSPP). This scenario becomes more challenging when attempting to recognize a query image under varying conditions such as occlusion, illumination, pose, and expression, with only one image per subject in the gallery set. Consequently, recognition systems struggle to identify the facial variations due to the limited data availability. Two fundamental approaches typically employed by researchers to address the SSPP issue, i.e., holistic, and local methods. Holistic methods include subspace models like Principal Component Analysis (PCA) [
5] and Linear Discriminant Analysis (LDA) [
6]. However these methods relied on large amounts of training data, which is the primary issue with the SSPP scenario.
Generative models play a crucial role when training data is scarce or difficult to obtain. A key generative model approach is the Generative Adversarial Network (GAN) [
7]; it is comprised of two competing neural networks, a generator, and a discriminator [
8] trained in an adversarial manner, where the generator network receives random noise as input and produces synthetic samples designed to mimic real data. Concurrently, the discriminator is trained to distinguish these generated samples from actual training data samples.
During the adversarial training process, the generator aims to produce increasingly realistic fake samples to deceive the discriminator, while the discriminator aims to enhance its ability to differentiate these generated samples from actual data. The generator and discriminator are involved in a min–max game where the generator seeks to minimize the discriminator’s ability to detect fake samples, while the discriminator aims to maximize its accuracy in classifying the samples as either real or fake. This iterative process results in the generator gradually learning to create synthetic data that become indistinguishable from real data for the discriminator. The trained generator generates new realistic samples aimed at extending the limited training data.
In this work, we present a novel approach capable of generating six different facial expressions from a single input image of a neutral face. Our approach utilizes a Conditional Generative Adversarial Network (CGAN) [
9] specifically designed to synthesize highly realistic facial variations while conditioning on the provided neutral face image. To evaluate the efficacy of our technique for synthetic image generation, we fine-tuned and trained several CNN models (such as VGG-Face, ResNet-50, FaceNet, and DeepFace) along with a custom CNN model on the generated images. We then evaluate their performance on the corresponding real images. The CNN models trained on our synthesized images and achieved 99% accuracy, demonstrating the realism and diversity of our generated facial expressions.
Furthermore, we explored the impact of our model on the challenging single sample per person (SSPP) problem faced in face recognition tasks. Initially, by training and fine-tuning the CNN models solely on real-world single neutral face images per person, the accuracy was approximately 76%. However, by augmenting the dataset with six synthetic expression images generated by our CGAN for each person, the accuracy of the CNN models improved to 99%. To further validate the robustness of our approach, we also generated realistic facial expressions for random neutral face images obtained from the Internet, as depicted in Figures 13 and 14, confirming the generalization capability of our CGAN model beyond the training data. Our contributions are listed as follows:
We propose a novel Conditional Generative Adversarial Network (CGAN) that generates six diverse facial expressions from a single input image of a neutral face.
We address the data scarcity problem of the single sample per person (SSPP) by generating multiple facial expressions for each individual. This enhances the training data for facial recognition systems and improves their performance.
We improve the accuracy of the single-sample face recognition system from 76% to 99% when trained with our generated images.
We validate the generalization ability of the CGAN model by generating facial expressions from randomly selected neutral face images outside the dataset.
The rest of the paper is defined as
Section 2, which explains all the related work related to SSPP and the Image Generation Method.
Section 3 explains the details of the dataset.
Section 4 explains the proposed model architecture and how it works.
Section 5 contains the results of the experimentation, and
Section 6 concludes the paper.
2. Related Work
In the literature, the facial recognition process is broadly divided into two categories: one is the holistic feature extraction approach and the second is the local feature-based approach, The lateral approach outperforms the former by an accuracy of 60% [
10]. The recent literature has reported the state-of-the-art work on facial recognition using CNN. For instance, the paper [
11] used CNN to address facial recognition under multiple samples and a single sample per person. For multiple samples per person, they learned the activation vector from the fully connected layer using a Visual Geometric Group (VGG) face feature extractor and normalized it with L2 normalization, and in the end, LDA was applied. In the paper ref. [
12], the authors proposed a method that uses the local binary patterns as the CNN’s input and then uses softmax regression function for face recognition. In the paper [
13], the author presented a joint collaborative representation with an adaptive convolutional feature for a Single-Sample Problem (SSP). The method extracts the local regions of the query image using CNN. These regions have similar coefficients, which help preserve all the local discriminative features and are robust to different facial variations. In the paper [
14], the authors proposed a supervised auto encoder technique in which they mapped multiple versions of a face image to a canonical face and extracted features that preserved a similarity criterion. However, they only used a few training images of size 32 × 32, which do not effectively tackle the facial variations of a large set of query samples.
In the paper [
15], the authors propose a sparse illumination transfer method that creates an illumination dictionary. The illumination transfer method only focused on dealing with the illumination changes on the frontal face image but did not focus on the actual shape of the face by extracting its global and local features. In [
16], ref. [
17] proposed a patch-based method that extracted the patch-based distribution of the training image and used a voting strategy to calculate the distances between the paths to patch manifolds. In [
18], the authors applied the divide and conquer strategy. They divided the face image into a set of non-overlapping blocks. Each block was further divided into overlapped patches, assumed to lie in a linear subspace. In [
19], ref. [
20] proposed a single sample per person domain adaptive network technique. The method assumed that the images obtained in the gallery set were under stable shooting conditions. To deal with the lack of training samples, they used a 3D face model and generated synthetic images to address different pose variations.
In the paper [
21], the author constructed feature dictionaries using both training and test images for under-sample face recognition in IoT-based applications. In [
22], the authors proposed an iterative dynamic generic learning method that incorporates a semi-supervised low-rank representation framework for prototype recovery to learn variation dictionary for SSPP problem. In the paper [
23], the author combined grayscale monogenic features and kernel sparse representation on multiple Riemannian manifolds. The approach extracts regional face discriminability and co-occurrence distributions, fusing multiple kernels using kernel alignment and normalization. In the paper [
24] the author proposes a binary coding method that is based on the self-organizing map. The method combines the self-organizing map and the bag of features model to extract semantic features in the middle level from the facial images. They also utilize a sift descriptor to obtain local features and then map them into semantic space using the self-organizing map.
In the paper [
25], the author proposed an approach for 2D pose-variant face recognition using the frontal view. The method calculates a pose view angle and matches the test images with rotated canonical face images. The test face is then warped to create a frontal view using landmark features. The distortion is also addressed by applying prior facial symmetry assumptions before matching with the frontal face. In the paper [
26], the author proposed an integrated approach by combining feature pyramid and triplet loss technique, Their approach reduces the computations by sharing the same backbone network. In the paper [
27], the author proposed a probabilistic interpretable comparison approach that is a scoring approach proposed for biometric systems. The approach combines multiple samples using Bayes theorem and offers a probabilistic interpretation of decision correctness.
These approaches mainly generate virtual samples to address the SSPP; however, they do not effectively address the adversarial problems related to the SSPP, which are different facial variations like anger, smile, occlusion, or any other facial expression that exists in the query image, and SSPP is still an open research challenge.
Generative Adversarial Networks (GANs) are widely employed for image synthesis tasks due to their ability to generate highly realistic images [
28]. Several GAN architectures have been proposed, such as the Super Resolution Generative Adversarial Network (SRGAN) [
29,
30] for enhancing the image resolution. Age-conditional GAN [
31] for the age progression of face images by utilizing the age information. Disentangled Representation Learning GAN (DR-GAN) [
32] aimed to learn disentangled latent representations of images that are selectively modified to achieve desired transformations like pose changes or expression editing.
The use of Conditional GANs (CGANs) [
33], which enable the generation of images based on the input data such as class labels or images, is relevant to our work. Tasks that include image-to-image translation [
34] and future frame prediction [
35] have been investigated with CGANs. Our approach focuses on the CGAN framework, utilizing the generator on a single input neutral face image to synthesize multiple realistic facial expressions.
3. Proposed Model
The overall approach is described in
Figure 1. We form training image pairs by utilizing 80% of the dataset images, defining a training direction from the neutral image A to the expression image B to learn individual expressions on the neutral face. During training, the generator receives the neutral image and a noise vector as input and produces a synthetic expression image. Contrarily, the discriminator, which is provided with the original expression images, aims to distinguish these synthetic images from the real expression images. The generator’s objective is to create highly realistic synthetic expressions that can deceive the discriminator. We train six separate Conditional GANs, each conditioned on the neutral input image, to generate one of the six target expressions.
For testing, we utilize the remaining 20% images, pairing each neutral face image (candidate) with an expression reference image from any individual. Each trained CGAN then generates the corresponding expression. To strictly evaluate the accuracy of our approach, we tested it using several pre-trained facial recognition models (VGG-Face, ResNet-50, FaceNet, and DeepFace) as well as a custom CNN model we developed. We formed a training set with the synthetic facial expression images generated by our CGANs and used this dataset to train the CNN models. Subsequently, we tested these CNN models on the original real images of the respective individuals and calculated the softmax scores to assess the realism of the generated expressions.
The CGAN [
34] model comprises two networks; the generator and the discriminator, as described earlier. However, CGAN is slightly different from a standard GAN as it takes input in pairs. In CGAN, the generator processes an input image, through various convolutional filters to produce an output image. Conversely, the discriminator evaluates pairs of images, one pairing the input image with the target original image and another pairing the same input image with the output image from the generator. It compares both paired images and distinguishes between the real and generated images. Based on this comparison, the weights of the generator are adjusted during training to enhance the quality of the new image that it produces in the next iteration. These iterations continue until the generator creates a synthetic image so realistic that the discriminator can no longer differentiate it from the real image.
3.1. Proposed Method
GANs are generative models that learn a mapping from random noise vector ‘c’ to an output image ‘b’, denoted as . In contrast, conditional GANs learn a mapping from a given observed image ‘a’ and a random noise vector ‘c’, to ‘b’, expressed as . The generator ‘G’ is trained to produce outputs indistinguishable from real images by an adversarially trained discriminator, ‘D’, which is trained to excel at identifying the generator’s “fakes”.
3.2. Objective Function
In a conditional Generative Adversarial Network (GAN), the objective is to train a generator (G) to produce realistic images that conditioned on given inputs. These inputs includes an observed image (a) and/or a random noise vector (c). Simultaneously, the discriminator (D) is trained to differentiate between real and those generated images by the generator.
The objective function of a conditional GAN is described as follows:
where
represents the expectation over pairs of real images (
a) and their corresponding labels (
b), and
denotes the expectation over pairs of real images (
a) and generated images (
) conditioned on random noise vector (
c).
The generator (
G) aims to minimize the loss function, where the discriminator (
D) seeks to maximize it. Hence, the optimal generator (
) is obtained by solving the minimax problem:
In practice, the
distance measure is often preferred over
because it results in less blurring in the generated images. Therefore, the final objective of training a conditional GAN includes the
distance term, given by
where
is a hyper-parameter controlling the importance of the
loss term.
3.3. Proposed Generator and Discriminator Network Architecture
The proposed Conditional Generative Adversarial Network (CGAN) model is composed of two main components: a generator and a discriminator, as depicted in
Figure 2 and
Figure 3. The generator network comprises an encoder and a decoder. The encoder begins by accepting an image as input and processing it through a series of convolutional layers that progressively down-sample the image. This transformation is illustrated in
Figure 3.
The skip connection layer bridges the encoder and decoder and passes the convolved features to the decoder. This allows the decoder to upsample the image from the extracted features received from the encoder. The resulting image generated by this process is then fed into the discriminator network. The discriminator evaluates two image pairs as input: the input image and the output image produced by the generator. As depicted in
Figure 2, the discriminator network itself functions as an encoder that down-samples the paired images through a series of convolutional layers, each followed by max polling. The stride is set to 1 to slightly reduce the image size. It then assesses the extracted features from both images to determine the authenticity of the generated image. The discriminator updates its weights based on the classification error between the generated and original images. Similarly, the weights of the generator are adjusted according to the disparities between the generated and target images.
3.4. Conditional Generative Adversarial Network (CGAN)
The objective is to develop a model that generates realistic synthetic images using CGAN; therefore, choosing the appropriate hyper-parameters is crucial. In general, CGANs are hard to train because they suffer from a lot of different issues, such as mode collapse. GAN consists of two competing CNN-based networks involved in a min–max game, both striving to outperform each other. Initially, the generator often struggles to produce convincing images, allowing the discriminator to easily detect them as fakes. In some instances, the generator may fail to produce any realistic images, whereas sometimes the generator that operates as an adversary deceives the discriminator, blurring the distinction between real and synthetic images. Therefore, setting the right set of hyper-parameters is crucial from the beginning to achieve effective training.
We successfully selected the appropriate hyper-parameters after extensive experimentation involving trail-and-error approach. Given that CGAN effectively learns by taking paired images as input, we formed various image pairs to facilitate in learning the desired mapping process. To cover a comprehensive range of facial variations, we trained six distinct CGANs, each trained to learn one of the following six facial expressions: smile, anger, disguise, illumination, stare, and wearing glasses.
3.5. Image Pairs Formation for Training and Testing
We allocated 80% of the dataset, comprising 107 subjects, for training, randomly separating these images from the rest. The remaining 20% of the images were reserved for testing the performance of the CGANs and CNNs, respectively. The neutral face of each subject was paired with images showcasing six different facial variations such as smile, anger, disguise, glasses, illumination, and occlusion. These pairs were used to facilitate CGANs in learning each facial variations specifically. The training pairs are depicted in
Figure 4.
The 20% testing set comprises 27 subjects, each represented by a single frontal neutral face image. For testing, image pairs were formed to evaluate each of the six CGANs independently. These image pairs were formed by matching the neutral single sample per person image with various facial expression images (i.e., smile, anger, stare, occlusion, illumination, glasses) from other subjects randomly. This setup was used to generate six facial variations for each neutral face image of each subject, as depicted in
Figure 5).
3.6. Hyper-Parameter Setup
Optimizing the correct set of hyper-parameters was crucial from the outset of our study. We conducted over 250 extensive experiments using a grid search methodology to methodically explore a wide range of values for each hyper-parameter. This systematic approach, detailed in
Table 1, allowed us to vary one hyper-parameter at a time while maintaining baseline values for others, thereby isolating the effects of each parameter on model performance. Notably, our rigorous testing established an initial learning rate of 0.002, chosen for its effectiveness in balancing training stability and speed. Additionally, the number of epochs, as documented in
Table 1, was iteratively adjusted based on ongoing experimental results to optimize model performance without over-fitting. This process ensures that our model achieves the best possible accuracy while maintaining generalizability across various datasets.
4. Dataset
We used the AR dataset [
36] for our experimentation, which comprises raw frontal face images featuring various expressions. The dataset includes a total of 124 subjects, with 3300 images, distributed between 70 males and 54 female subjects. The images were captured across different sessions under varying illumination conditions with partial occlusions. We extracted six facial variations per person from this dataset, which were smile, anger, stare, illumination, wearing glasses, and wearing a scarf.
We randomly divided the dataset into two parts: 80% for training and 20% for testing. The training portion, comprising 80% of images, was used to train our Conditional Generative Adversarial Network (CGAN). This enabled the CGAN to learn the generation of six distinct facial variations from the neutral face of each person. The remaining 20%, which included neutral faces of several individuals, were completely withheld from the CGAN during the training phase. This test set was used to evaluate the generalizability and robustness of our model against new, unseen data.
During testing, the CGAN utilized the withheld neutral face images to generate the six facial variations. Additionally, we also retained the original facial variation images from the 20% set to later assess the performance of the CNN models. We focused our experimentation on AR dataset to demonstrate the potential of our method for transfer learning and its ability to produce synthetic images for any given neutral image, as depicted in Figures 13 and 14.
5. Results and Discussion
As mentioned previously, the allocated 20% of the dataset was used to test our Conditional Generative Adversarial Network (CGAN), as detailed in
Figure 5. This section focuses on the outcomes of these tests, which resulted in the successful generation of six distinct facial expressions from a single neutral face image, which are showcased in
Figure 6.
During the testing phase, each facial variation image is paired with a neutral face image and then fed into the CGAN as input. It is important to note that these facial variation images used for pairing are not specific to any identity or gender, underscoring and ensuring the broad applicability of our method. This approach allows any image, regardless of the individual’s identity or gender, to be effectively paired with a neutral face. By leveraging such pairing, the CGAN is able to generate corresponding variations on the neutral face images, thereby demonstrating its robustness and versatility.
In
Figure 7, we demonstrate the effectiveness of our model by directly comparing the generated images with the original images of the same individuals. These comparisons reveal that the facial variations in the generated images closely resemble those in the original images, highlighting the high fidelity of our synthetic image generation. To further assess the robustness of our model, particularly in terms of its adaptability to variations in skin tone and gender, we refer to
Figure 8 and
Figure 9. In these figures, we selected the neutral face image of the individuals with particularly darker skin tones and paired them with reference images. It is noteworthy that the reference image is given to complement the overall working of the conditional GAN, which requires images as input pairs. Six facial variations are generated by the model on the neutral faces of both male and female individuals, demonstrating the model’s capability to handle diverse facial features effectively.
5.1. Training CNN Models on Generated Images
After training the six Conditional Generative Adversarial Networks (CGANs) and generating the synthetic images, we employ several CNN models to evaluate their quality. Specifically, we employ state-of-the-art models such as VGG-Face, ResNet-50, FaceNet, and DeepFace, along with our custom CNN model. For the custom CNN model, all input images are standardized to a resolution of pixels, with three color channels (RGB). The CNN processes the input image starting with the first layer, which consists of 32 convolutional filters, each of size .
A series of convolutional filters with varying sizes (32, 64, 128, and 256) are applied in sequential pairs to extract features at different scales. After each pair of convolutional filters, a max pooling layer with a filter is used to reduce the spatial dimensions of the feature maps. Following the convolutional and max pooling layers, a fully connected layer is added to integrate the learned features for further processing.
The network architecture includes two dense layers, each comprising 4096 nodes, to process the features more deeply. Dropout is employed as a regularization strategy to mitigate over-fitting, with a dropout rate designed to prevent the network from becoming overly reliant on any single feature. Additionally, to ensure the stability of the learning process and prevent overfitting, a momentum of 0.5 is maintained during training.
5.2. Testing the CNN Models on Original Images Containing Variations
After fine-tuning the pre-trained models such as VGG-Face, ResNet-50, FaceNet, and DeepFace and training our custom model on the generated samples, we observed the outcomes. The training and validation accuracies, along with the training and validation losses of our custom model, are detailed in
Figure 10. We then tested the performance of these models on the original images retained from the dataset (as discussed in
Section 4), which are depicted in
Figure 11. These segregated images serve as a benchmark for evaluating the effectiveness of our CGAN-generated samples.
5.3. Comparative Analysis of CGAN Generated Samples vs. Single Image per Person Using CNN Models
We evaluated the effectiveness of our CGAN-generated samples against single images per person using several pre-trained facial recognition models (VGG-Face, ResNet-50, FaceNet, and DeepFace) as well as a custom CNN model. When trained solely on single-sample neutral face images, these models achieved a testing accuracy of only about 76%, highlighting the challenge posed by the SSPP problem. However, after fine-tuning these models on synthetic facial variation generated by our CGAN, their accuracy increased to 99%, as detailed in
Table 2. This significant improvement highlights the efficacy of our proposed method in overcoming the data scarcity issue associated with SSPP.
5.4. Comparison between Our Method and State of the Art Method
In this study, we conducted a comprehensive benchmark comparison with the current state-of-the-art (SOTA) facial recognition methods. We specifically compared our approach to the techniques presented in [
19,
20], which are known for their high accuracy under controlled shooting conditions. However, these methods often underperform in less stable environments, a significant aspect that is critical for practical applications. For instance, the method described in [
19] is highly dependent on specific lighting and facial positioning, reducing its robustness to variable conditions typically encountered in real-world settings.
Furthermore, the method described in [
17,
26] shows a significant decline in performance when confronted with complex facial variations and variable conditions such as anger, disguise, and changes in illumination. These challenges necessitate the need for models that are capable of maintaining high accuracy across diverse conditions. Our approach attempts to bridge this gap, as evidenced by the comparative analysis presented in
Figure 12.
In addition to this analysis, further comparative results are detailed in
Table 2. This table evaluates the performance of several state-of-the-art models, including VGG-Face [
37], ResNet-50 [
38], FaceNet [
39], and DeepFace [
40], with a specific focus on addressing the challenges posed by the SSPP problem. We also maintained consistent dataset configurations across all tests, as described in
Section 4 of the manuscript. Overall, our proposed model demonstrates higher accuracy in both scenarios when trained with and without the generated images.
Initial tests on our SSPP dataset revealed a significant decrease in recognition accuracy of even pretrained state-of-the-art models, as shown in
Table 2. However, subsequent fine-tuning of these models with our CGAN-generated images led to notable improvements in accuracy. This not only underscores the effectiveness of our CGAN model in generating high-quality synthetic samples but also demonstrates its utility in enhancing the performance of existing algorithms under SSPP conditions.
Our approach not only surpasses these previous techniques in recognition accuracies but also showcases a significant advantage in terms of dataset versatility. Unlike prior methods that often rely on large, diverse, and specific datasets for training, our model exhibits dataset independence. After initial training on the AR dataset, our system can generate synthetic images encompassing a wide range of facial expressions and variations from just a single neutral expression. This capability not only enhances the robustness of our model but also provides a solution for mitigating the common data scarcity in facial recognition research.
5.5. Testing the Generalization Ability of the Model
To test the generalization ability of the model, we sourced several neutral face images from the internet and paired them with reference variation images from our testing data. It is worth noting that internet-sourced images differed significantly in terms of various aspects such as orientation, the distance from the camera, and the general settings of the image capture compared to those in the AR dataset used to train the model. The robustness of our method in handling these variations effectively is depicted in
Figure 13 and
Figure 14.
Figure 13.
Smile expression generation on randomly sampled images outside the dataset.
Figure 13.
Smile expression generation on randomly sampled images outside the dataset.
Figure 14.
Stare expression generation on randomly sampled images outside the dataset.
Figure 14.
Stare expression generation on randomly sampled images outside the dataset.
6. Discussion on Results and Advantages of the Approach
The application of Conditional Generative Adversarial Networks (CGANs) to generate synthetic facial images has proven highly effective for enriching the datasets with diverse facial variations, overcoming the key challenge of data scarcity often encountered in SSPP scenarios. The model exhibits robust performance against variability by accurately generating facial expressions across a range of lighting conditions and disguises. Furthermore, the model also demonstrates dataset independence; once trained on the baseline dataset, it can produce accurate variations from a single neutral image, thereby potentially eliminating the need for extensive and diverse training datasets. The proposed model not only matches but also surpasses various state-of-the-art methods, particularly in its ability to handle complex facial expressions and adapt to uncontrolled shooting conditions.
7. Future Work
In future work, we aim to further enhance the quality of the generated images at higher resolutions and improve their realism. To achieve this, we will explore and implement super-resolution techniques within the generative models. Additionally, we plan to expand the model’s ability to produce various facial expressions from different viewing angles, such as side and angled views, to address real-world application challenges. To further minimize the artifacts in the generated images, we intend to integrate an artifact detection mechanism directly into the CGAN training process or developing a post-processing module to refine the output images.
8. Conclusions
Generative Adversarial Networks have shown outstanding results in addressing data scarcity issues, particularly in real-world problems such as the single sample per person (SSPP) problem. Our research is dedicated to addressing the fundamental challenges associated with the SSPP problem. In real-world scenarios, there are many situations that require reliable face recognition from a single image. This is a critical issue in fields such as information security, biometrics, access control, and various other identification applications. To address the challenge posed by SSPP, we fine-tuned and implemented six Conditional Generative Adversarial Networks (CGANs), each designed to generate a specific facial variation (smile, anger, stare, illumination, glasses, and occlusion) from a single input neutral face image. We evaluated our method’s effectiveness with several pre-trained facial recognition models (VGG-Face, ResNet-50, FaceNet, and DeepFace) as well as a custom convolutional neural network (CNN) model. Initially these models achieved only around 76% accuracy when tested on single-sample neutral face images, emphasizing the difficulty of the SSPP problem. However, after fine-tuning with synthetic facial variations generated by our CGANs from those single neutral images, their accuracy increased significantly to 99%. Our proposed model demonstrates the capability of transfer learning, enabling it to generate facial expressions on any set of neutral face images beyond the training data.
Author Contributions
Conceptualization, M.A.I. and W.J.; methodology, M.A.I. and W.J.; software, M.A.I. and W.J.; validation, M.A.I., W.J. and S.K.K.; formal analysis, M.A.I., W.J. and S.K.K.; investigation, S.K.K.; resources, M.A.I., W.J. and S.K.K.; data curation, M.A.I.; writing—original draft preparation, M.A.I.; writing—review and editing, M.A.I.; visualization, M.A.I.; supervision, W.J. and S.K.K.; project administration, W.J.; funding acquisition, S.K.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by “Regional Innovation Strategy (RIS)” through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (MOE) (2023RIS-009).
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analyzed in this study. Data sharing is not applicable to this article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Woubie, A.; Solomon, E.; Attieh, J. Maintaining Privacy in Face Recognition using Federated Learning Method. IEEE Access 2024, 12, 39603–39613. [Google Scholar] [CrossRef]
- Saadabadi, M.S.E.; Malakshan, S.R.; Zafari, A.; Mostofa, M.; Nasrabadi, N.M. A quality aware sample-to-sample comparison for face recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6129–6138. [Google Scholar]
- Liu, F.; Chen, D.; Wang, F.; Li, Z.; Xu, F. Deep learning based single sample face recognition: A survey. Artif. Intell. Rev. 2023, 56, 2723–2748. [Google Scholar] [CrossRef]
- Ríos-Sánchez, B.; Costa-da Silva, D.; Martín-Yuste, N.; Sánchez-Ávila, C. Deep learning for facial recognition on single sample per person scenarios with varied capturing conditions. Appl. Sci. 2019, 9, 5474. [Google Scholar] [CrossRef]
- Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. Face recognition systems: A survey. Sensors 2020, 20, 342. [Google Scholar] [CrossRef]
- Benouareth, A. An efficient face recognition approach combining likelihood-based sufficient dimension reduction and LDA. Multimed. Tools Appl. 2021, 80, 1457–1486. [Google Scholar] [CrossRef]
- Trevisan de Souza, V.L.; Marques, B.A.D.; Batagelo, H.C.; Gois, J.P. A review on Generative Adversarial Networks for image generation. Comput. Graph. 2023, 114, 13–25. [Google Scholar] [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
- Hanano, T.; Seo, M.; Chen, Y.W. An improved cgan with self-supervised guidance encoder for generation of high-resolution facial expression images. In Proceedings of the 2023 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2023; pp. 1–4. [Google Scholar]
- Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, present, and future of face recognition: A review. Electronics 2020, 9, 1188. [Google Scholar] [CrossRef]
- Sahan, J.M.; Abbas, E.I.; Abood, Z.M. A facial recognition using a combination of a novel one dimension deep CNN and LDA. Mater. Today Proc. 2023, 80, 3594–3599. [Google Scholar] [CrossRef]
- Vu, H.N.; Nguyen, M.H.; Pham, C. Masked face recognition with convolutional neural networks and local binary patterns. Appl. Intell. 2022, 52, 5497–5512. [Google Scholar] [CrossRef]
- Yang, M.; Wang, X.; Zeng, G.; Shen, L. Joint and collaborative representation with local adaptive convolution feature for face recognition with single sample per person. Pattern Recognit. 2017, 66, 117–128. [Google Scholar] [CrossRef]
- Gao, S.; Zhang, Y.; Jia, K.; Lu, J.; Zhang, Y. Single sample face recognition via learning deep supervised autoencoders. IEEE Trans. Inf. Forensics Secur. 2015, 10, 2108–2118. [Google Scholar] [CrossRef]
- Abdelmaksoud, M.; Nabil, E.; Farag, I.; Hameed, H.A. A novel neural network method for face recognition with a single sample per person. IEEE Access 2020, 8, 102212–102221. [Google Scholar] [CrossRef]
- Gao, S.; Jia, K.; Zhuang, L.; Ma, Y. Neither global nor local: Regularized patch-based representation for single sample per person face recognition. Int. J. Comput. Vis. 2015, 111, 365–383. [Google Scholar] [CrossRef]
- Dang, T.V. Smart attendance system based on improved facial recognition. J. Robot. Control (JRC) 2023, 4, 46–53. [Google Scholar] [CrossRef]
- Liu, F.; Tang, J.; Song, Y.; Xiang, X.; Tang, Z. Local structure based sparse representation for face recognition with single sample per person. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 713–717. [Google Scholar]
- Hong, S.; Im, W.; Ryu, J.; Yang, H.S. Sspp-dan: Deep domain adaptation network for face recognition with single sample per person. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 825–829. [Google Scholar]
- Kumar, K.V.; Teja, K.A.; Bhargav, R.T.; Satpute, V.; Naveen, C.; Kamble, V. One-shot face recognition. In Proceedings of the 2023 2nd International Conference on Paradigm Shifts in Communications Embedded Systems, Machine Learning and Signal Processing (PCEMS), Nagpur, India, 5–6 April 2023; pp. 1–6. [Google Scholar]
- Yang, S.; Wen, Y.; He, L.; Zhou, M. Sparse Common Feature Representation for Undersampled Face Recognition. IEEE Internet Things J. 2021, 8, 5607–5618. [Google Scholar] [CrossRef]
- Pang, M.; Cheung, Y.M.; Shi, Q.; Li, M. Iterative Dynamic Generic Learning for Face Recognition From a Contaminated Single-Sample Per Person. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1560–1574. [Google Scholar] [CrossRef]
- Zou, J.; Zhang, Y.; Liu, H.; Ma, L. Monogenic features based single sample face recognition by kernel sparse representation on multiple riemannian manifolds. Neurocomputing 2022, 504, 82–98. [Google Scholar] [CrossRef]
- Liu, F.; Wang, F.; Ding, Y.; Yang, S. SOM-based binary coding for single sample face recognition. J. Ambient Intell. Humaniz. Comput. 2022, 13, 5861–5871. [Google Scholar] [CrossRef]
- Petpairote, C.; Madarasmi, S.; Chamnongthai, K. 2d pose-invariant face recognition using single frontal-view face database. Wirel. Pers. Commun. 2021, 118, 2015–2031. [Google Scholar] [CrossRef]
- Tsai, T.H.; Chi, P.T. A single-stage face detection and face recognition deep neural network based on feature pyramid and triplet loss. IET Image Process. 2022, 16, 2148–2156. [Google Scholar] [CrossRef]
- Neto, P.C.; Sequeira, A.F.; Cardoso, J.S.; Terhörst, P. PIC-Score: Probabilistic Interpretable Comparison Score for Optimal Matching Confidence in Single-and Multi-Biometric Face Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1021–1029. [Google Scholar]
- Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 2021, 35, 3313–3332. [Google Scholar] [CrossRef]
- Singla, K.; Pandey, R.; Ghanekar, U. A review on Single Image Super Resolution techniques using generative adversarial network. Optik 2022, 266, 169607. [Google Scholar] [CrossRef]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Ning, X.; Gou, D.; Dong, X.; Tian, W.; Yu, L.; Wang, C. Conditional generative adversarial networks based on the principle of homologycontinuity for face aging. Concurr. Comput. Pract. Exp. 2022, 34, e5792. [Google Scholar] [CrossRef]
- Tran, L.; Yin, X.; Liu, X. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1415–1424. [Google Scholar]
- Lu, Y.; Wu, S.; Tai, Y.W.; Tang, C.K. Image generation from sketch constraint using contextual gan. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 205–220. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Kwon, Y.H.; Park, M.G. Predicting future frames using retrospective cycle gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1811–1820. [Google Scholar]
- Martinez, A.; Benavente, R. The AR Face Database: CVC Technical Report, 24. 1998. Available online: http://www.cat.uab.es/Public/Publications/1998/MaB1998/CVCReport24.pdf (accessed on 3 June 2024).
- Parkhi, O.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015-Proceedings of the British Machine Vision Conference 2015, British Machine Vision Association, Swansea, UK, 7–10 September 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
Figure 1.
Proposed CGAN architecture for synthetic image generation.
Figure 1.
Proposed CGAN architecture for synthetic image generation.
Figure 2.
Proposed discriminator architecture.
Figure 2.
Proposed discriminator architecture.
Figure 3.
Proposed generator architecture.
Figure 3.
Proposed generator architecture.
Figure 4.
Image pairs for training CGAN. Here, we paired each neutral face image with six different expression images, respectively. In (A), we paired the neutral face image with illumination change image. In (B), we paired the neutral face image with anger image. In (C), we paired the neutral face image with the glasses image. In (D), we paired the neutral face image with the disguise image. In (E), we paired the neutral face with the stare image. In (F), we paired the neutral face with the smile image.
Figure 4.
Image pairs for training CGAN. Here, we paired each neutral face image with six different expression images, respectively. In (A), we paired the neutral face image with illumination change image. In (B), we paired the neutral face image with anger image. In (C), we paired the neutral face image with the glasses image. In (D), we paired the neutral face image with the disguise image. In (E), we paired the neutral face with the stare image. In (F), we paired the neutral face with the smile image.
Figure 5.
Containing set of paired images for CGAN testing. A neutral image of a person is paired with the expressions of different individuals for testing each CGAN performance. In (A), we paired the neutral face image with the stare image of a different individual. In (B), we paired the neutral face image with the smile image of a female individual. In (C), we paired neutral face image with the anger image of a different individual. In (D), we paired the neutral face image with the glasses image of a different individual. In (E), we paired the neutral face with the illumination change image of a different individual. In (F), we paired the neutral face with a disguised image of a female individual. Here, the reference paired images are identity-independent.
Figure 5.
Containing set of paired images for CGAN testing. A neutral image of a person is paired with the expressions of different individuals for testing each CGAN performance. In (A), we paired the neutral face image with the stare image of a different individual. In (B), we paired the neutral face image with the smile image of a female individual. In (C), we paired neutral face image with the anger image of a different individual. In (D), we paired the neutral face image with the glasses image of a different individual. In (E), we paired the neutral face with the illumination change image of a different individual. In (F), we paired the neutral face with a disguised image of a female individual. Here, the reference paired images are identity-independent.
Figure 6.
Six generated variations provided from a single neutral face image. The variations are generated glasses, generated illumination, generated anger, generated stare, generated smile, and generated glasses. All these expressions were generated from a single neutral face image.
Figure 6.
Six generated variations provided from a single neutral face image. The variations are generated glasses, generated illumination, generated anger, generated stare, generated smile, and generated glasses. All these expressions were generated from a single neutral face image.
Figure 7.
One-on-one comparison of the generated and real image/original of the same person.
Figure 7.
One-on-one comparison of the generated and real image/original of the same person.
Figure 8.
Six generated facial variations for a female individual with darker skin tone.
Figure 8.
Six generated facial variations for a female individual with darker skin tone.
Figure 9.
Six generated facial variations for a male individual with darker skin tone.
Figure 9.
Six generated facial variations for a male individual with darker skin tone.
Figure 10.
Illustration of training and validation accuracies, along with training loss and validation loss of the Convolutional Neural Network (CNN).
Figure 10.
Illustration of training and validation accuracies, along with training loss and validation loss of the Convolutional Neural Network (CNN).
Figure 11.
Original images of the person used for testing the performance of the CNN model.
Figure 11.
Original images of the person used for testing the performance of the CNN model.
Figure 12.
Accuracy comparison with state-of-the-art SSPP methods [
17,
19,
20,
26].
Figure 12.
Accuracy comparison with state-of-the-art SSPP methods [
17,
19,
20,
26].
Table 1.
Overview of optimized hyper-parameters for facial pose generation.
Table 1.
Overview of optimized hyper-parameters for facial pose generation.
Facial Poses | Batch Size | Generator Direction | Learning Rate | Epochs |
---|
Smile | 1 | A to B | 0.0002 | 650 |
Anger | 1 | A to B | 0.0002 | 650 |
Stare | 1 | A to B | 0.0002 | 650 |
Scarf | 1 | A to B | 0.0002 | 650 |
Lighting | 1 | A to B | 0.0002 | 650 |
Glasses | 1 | A to B | 0.0002 | 650 |
Table 2.
In this table, we compare the accuracy of each model with and without generated images.
Table 2.
In this table, we compare the accuracy of each model with and without generated images.
Methods | Accuracy w/o Generated Images | Accuracy w/ Generated Images | Change in Accuracy |
---|
SSAE [19] | 65.31% | 85.56% | +20.25% |
SAS [20] | 68.34% | 89.87% | +21.53% |
OSL [17] | 65.23% | 87.22% | +21.99% |
SSFD [26] | 62.12% | 90.21% | +28.09% |
VGG-Face [37] | 75.45% | 99.21% | +23.76% |
ResNet-50 [38] | 76.55% | 99.55% | +23.00% |
FaceNet [39] | 75.88% | 99.01% | +23.13% |
DeepFace [40] | 76.89% | 99.21% | +22.32% |
CustomCNN | 68.00% | 91.12% | +23.12% |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).