Electronics
Electronics
Review
A Contemporary Survey on Deepfake Detection: Datasets,
Algorithms, and Challenges
Liang Yu Gong *,† and Xue Jun Li *,†
Abstract: Deepfakes are notorious for their unethical and malicious applications to achieve economic,
political, and social reputation goals. Recent years have seen widespread facial forgery, which does
not require technical skills. Since the development of generative adversarial networks (GANs) and
diffusion models (DMs), deepfake generation has been moving toward better quality. Therefore, it
is necessary to find an effective method to detect fake media. This contemporary survey provides
a comprehensive overview of several typical facial forgery detection methods proposed from 2019
to 2023. We also analyze and group them into four categories in terms of their feature extraction
methods and network architectures: traditional convolutional neural network (CNN)-based detection,
CNN backbone with semi-supervised detection, transformer-based detection, and biological signal
detection. Furthermore, it summarizes several representative deepfake detection datasets with their
advantages and disadvantages. Finally, we evaluate the performance of these detection models with
respect to different datasets by comparing their evaluating metrics. Across all experimental results
on these state-of-the-art detection models, we find that the accuracy is largely degraded if we utilize
cross-dataset evaluation. These results will provide a reference for further research to develop more
reliable detection algorithms.
beyond research articles and their citations; then, we surprisingly noticed a shift in the
publication trend, with deepfake detection surpassing deepfake generation in the past
two years. Therefore, it is necessary to conduct in-depth research on deepfake detection
methods for future investigators to study in order to prevent fraud through facial forgery
or illegal information dissemination via deepfakes.
(a)
(b)
Figure 1. Relative publication data obtained from Dimensions database [9] at the end of 2023 by
searching “deepfake generation” and “deepfake detection” as keywords: (a) number of deepfake-
generation-related scholarly papers from 2014 to 2023; (b) number of deepfake-detection-related
papers from 2014 to 2023.
Thus, a comprehensive literature review of deepfake detection will be useful for re-
searchers to study this field further in different aspects with the development of deepfake
generation. This motivates us to present a deepfake detection survey in review of (1) deep-
fake detection databases, (2) categorized several typical deepfake detection of frame-based
and video-based methods, (3) the latest trend of detection methods (biological signal based),
and (4) a summary and analysis of future trend of deepfake detection. Specifically, the
main contributions of this survey are twofold.
Electronics 2024, 13, 585 3 of 22
2.1. FaceForensics++
FaceForensics++ [13] is a pioneering large-scale dataset in the field of face manipulation
detection. The main facial manipulations are representative, which include DeepFakes,
Face2Face, FaceSwap, FaceShifter, and Neural Textures methods, and data are of random
compression levels and sizes [14]. This database originates from YouTube videos with
1000 real videos and 4000 fake videos, the content of which contains 60% female videos
and 40% male videos. In addition, there are three resolutions of videos: 480p (VGA), 720p
(HD), and 1080p (FHD). As a pioneering dataset, it has different quality levels of data and
equalized gender distributions. The deepfake algorithms include face alignment and Gauss–
Newton optimization. However, this dataset suffers from low visual quality with high
compression and visible boundaries of the fake mask. The main limitation of this dataset is
the lack of advanced color-blending processing, resulting in some source facial colors being
easily distinguishable from target facial colors. In addition, some target samples cannot
effectively fit on the source faces because there exists facial landmark mismatch, which is
shown in Figure 2.
Figure 2. Several FaceForensics++ samples. The manipulated methods are DeepFakes (Row 1),
Face2Face (Row 2), FaceSwap (Row 3), and Neural Textures (Row 4). DeepFakes and FaceSwap
methods usually create low-quality manipulated facial sequences with color, landmark, and boundary
mismatch. Face2Face and Neural Textures methods can output slightly better-quality manipulated
sequences but with different resolutions.
Electronics 2024, 13, 585 4 of 22
2.2. DFDC
From 2020 to 2023, Facebook, Microsoft, Amazon, and research institutions put efforts
into this field and jointly launched a Deep Fake Detection Challenge (DFDC) [8] on Kaggle
to solve the problem of deepfakes presenting realistic AI-generated videos of people
performing illegal activities, with a strong impact on how people determine the legitimacy
of online information. The DFDC dataset is currently the largest public facial forgery
dataset, which contains 119,197 video clips of 10 s duration filmed by real actors. The
manipulation data (See Figure 3) are generated by deepfake, GAN-based, and non-learned
techniques with resolutions ranging from 320 × 240 to 3840 × 2160 and frame rates from
15 fps to 30 fps. Compared with FaceForensics++, this database has a large-enough sample
amount, different poses, and a rich diversity of human races. In addition, the original
videos are from 66 paid actors instead of YouTube videos, and fake videos are generated
with similar attributes to original real videos. However, the main drawback is that the
quality level of data is different due to several deepfake generative abilities. Therefore,
some samples have the problem of boundary mismatch, and source faces and target faces
have different resolutions.
Figure 3. DFDC samples. We manually utilized InsightFace facial detection model to extract human
faces from the DFDC. Although some of the samples are without color blending and with obvious
facial boundaries, the average quality is a little higher than the first-generation deepfake datasets.
2.3. Celeb-DF V2
Celeb-DF V2 is derived from 590 original YouTube celebrity videos and 5639 manip-
ulated videos generated through FaceSwap [15] and DFaker as mainstream techniques.
It consists of multiple age, race, and sex distributions with many visual improvements,
making fake videos almost indistinguishable to the human eye [16]. The dataset exhibits a
large variation of face sizes, orientations, and backgrounds. In addition, some post-processing
work is added by increasing the high resolution of facial regions, applying color transfer
algorithms and inaccurate face masks. However, the main limitation of this dataset is the
low data amount with less sample diversity because all original samples are downloaded
from YouTube celebrity videos, and there is small ethnic diversity, especially for Asian
faces.Here, we present a few samples of Celeb-DF V2 (see Figure 4).
Electronics 2024, 13, 585 5 of 22
Figure 4. Celeb-DF V2 crop manipulated facial frames. Except for transgender and transracial fake
samples (Row 3), it is hard to distinguish real and fake images with the human eye.
Table 1. The typical and commonly used datasets of facial forgery detection.
3. Deepfake Detection
Different survey articles have categorized deepfake detection methods from different
perspectives. For example, Rana et al. [7] group detection methods into four categories:
deep-learning-based techniques, classical machine-learning-based techniques, statistical
Electronics 2024, 13, 585 6 of 22
Capsule Network
Capsule Network [22] itself was not a new term in the field of deep learning, and it
was first proposed to solve the problem that traditional convolutional neural networks
(CNNs) have limited applicability in “inverse graphics”. For classification work, traditional
CNNs aimed to stack convolutional layers to extract multi-scale features corresponding
to different receptive fields, where there is less consideration of the relationship between
different feature information. However, Capsule Network (See Figure 5) has one typical
strength: they are able to learn 3D spatial information about objects with their relationship
and then model them explicitly. Moreover, Capsule Network [23] can use fewer parameters
and data to perform similarly to CNN.
Specifically, Nguyen et al. [23] proposed a VGG-19 [24] feature extraction backbone,
which is from the first layer to the third layer as the backbone and pre-trained in ILSVRC
datasets [25], able to reduce overfitting and transfer learning. The input images are set to
300 × 300. The output features are passed through the backbone and sent to 10 primary
Capsule Networks and output 4 × 1 vector capsules by a dynamic routing algorithm. Each
primary Capsule Network [26] consists of two 2D CBL modules (convolution layer + batch
normalization + ReLU), a statistical pooling layer and two 1D CB modules (1D convolution
layer + batch normalization), and the overall CapsuleNet includes the parallel connection
of ten primary Capsule Nets (see Figure 6). The parameters of convolutional blocks are
shown in Figure 6 as well. Another difference between CapsuleNet and CNN is the output.
Dynamic routing is designed to calculate how the output of lower-level capsules is allocated
Electronics 2024, 13, 585 7 of 22
Figure 5. The architecture of CapsuleNet faces forensics detection. This method utilizes the CNN
backbone to extract features and Capsule Network to output vectors for prediction.
Figure 6. Details of the primary Capsule Network structure with relative parameters. Each Capsule
Network includes the parallel connection of 10 primary capsules.
Figure 7. The architecture of CORE. One method extracts pairs of representations by data augmenta-
tion and calculates consistency loss to guide final loss function.
Thus, CORE is used to observe the similarity between two views’ representations
obtained from data augmentations and feature extraction because we regard that each
type of data should be consistent even though applying different data augmentations. The
main strength of this method is to utilize a new loss function that the classification work is
determined by both cross-entropy loss and representation similarity after different data
Electronics 2024, 13, 585 9 of 22
Figure 8. The architecture of the consistency branch and classification branch [32], One method
calculates the patch similarity and classification loss to guide the final loss.
1
L PCL =
N ∑ BCE(Vpred , VGT ) (6)
Electronics 2024, 13, 585 10 of 22
In short, this method also utilizes two factors to determine the final loss function.
However, the difference is to find the patches’ similarity and not the consistency between
different representations because it is based on the notion that the forgery clues of fake
video generation will project to different feature patches. In addition, it requires performing
ablation experiments on four datasets, including DFR, CD2, DFDC, and DFDC-P, and the
average area under the curve (AUC) reached 82% above, which is a significant development
in forgery detection. However, there are two limitations: this method cannot detect entire
facial synthesis created by GAN or diffusion models, and it can be further improved on
low-quality data.
The architecture of DCL (see Figure 9) includes two branches of feature extraction
encoders ( f q and f k); then, the batch images are fed to encoders to obtain extracted features
by applying 1 × 1 convolution operation to squeeze the channel to obtain the query q
and the key k separately. The encoders utilize the strategy of the exponential moving
average. The parameters of the key encoder are updated from query encoder parameters.
It promotes generalized feature learning by cross-entropy loss, InfoNCE loss [38], and
intra-instance contrastive loss together. For the performance of T-face, Sun et al. [36]
conducted a cross-dataset experiment on FF++ (trained), DFD, DFDC, Wild DeepFake, and
Celeb-DF. The AUC metric and equal error rate (EER) presented relatively good results on
Celeb-DF and DFD, which are shown in Table 3.
Figure 9. An overview of DCL architecture. Reprinted with permission from Ref. [36]. 2024, K.Sun.
Four random data augmentation factors were utilized in this contrastive learning method to guide
inter-class and intra-class distances separately.
Figure 10. The structure of STIL [40]. Each STIL block contains SIM, TIM, and ISM modules.
Firstly, the input sequence [ T, C, H, W ] was split along the channel dimension into two
portions, where each portion is a feature with a size of [ T, C2 , H, W ]. Then, two portions were
separately fed as the inputs of TIM and SIM to acquire two inconsistency representations
from spatial and temporal perspectives. Specifically, SIM and TIM were designed as three
branch modules, as shown in Tables 4 and 5, which aimed to find pixel-wise boundary
mismatch and temporal inconsistency. Based on the ablation experiments, STIL proposed
that the best performance occurs when we fuse SIM modules into TIM modules.
TIM modules utilize temporal difference calculation, which is the subtraction of
adjacent frame features.
sth = Conv1( xth+1 − xth ) (7)
where Conv1 is 3 × 1 convolutional layer, and xth+1 and xth are features of t + 1 times of
frame and t times of frame, respectively.
Electronics 2024, 13, 585 12 of 22
Branches Operations
Upper branch ResNet block for shortcut connection
Downsampling, utilize 1 × 3 and 3 × 1 convolutional layer to obtain
Middle branch
vertical and horizontal features and upsampling
Confidence calculation Fuse the upper and middle features to obtain confidence by sigmoid
Bottom branch 3 × 3 convolutional layer and multiply with confidence
Branches Operations
Convolution and reshape, temporal difference calculation, and vertical
Upper branch
temporal inconsistency enhancement
Convolution and reshape, temporal difference calculation, and horizontal
Middle branch
temporal inconsistency enhancement
Bottom branch ResNet block for shortcut connection
flatten the positional embedding and project it into a latent space. A patch selection mech-
anism based on attention weight is applied in the second module, which can pay more
attention to sensitive information and dismiss the less useful information in the training
phase. Once the two-stream transformer blocks output the feature with the patch attention
weight, they will both make initial predictions. The average of all the predicted results in
the final decision.
Figure 11. An overview of DFDT framework for deepfake detection [44]. It includes overlapping
patch embedding, patch selection mechanism, multi-scale transformer block, and classifier.
embedding [47] applied on each temporal attention block is used to distinguish the order
of frames. Finally, the MLP block is selected for the final classification task.
Figure 12. An architecture of ISTVT [46]. It consists of four basic components: backbone feature
extraction, token embedding, self-attention blocks, and MLP.
The performance of cross-dataset experiments was tested, and the accuracy rates
of Celeb-DF, DFDC, and FaceShift were 84.1%, 74.2%, and 99.3%, respectively. With
the increased accuracy results, it is proved that the transformer with spatial–temporal
inconsistency detection reached higher generalization in unseen data than previous video-
based detection methods.
stability and representative power of PPG cells because a small window will miss PPG
frequencies, and a too-large window will include more noise.
Physiological signals can be utilized to classify deepfake videos, and they can also
combine multi-modal information; for example, Stefanov et al. [50] proposed a method (See
Figure 13) to extract physiological signals and utilized the graph convolutional network
(GCN) [51] to fuse video and physiological signal and detect the dissimilarity between
audio and video modality. In particular, this method proposes an intriguing algorithm
to obtain visual physiological signals according to the following steps: first, facial areas
are detected with alignment, and background areas are removed; then, they are passed
through MTTS-CAN [52] and a square occlusion patch to estimate heart rate and respiration
rate, respectively. Finally, the difference between the estimated signal with occlusion
and without occlusion is considered as the relative contribution of the mask-out region
for the two physiological signals. In addition, one graph-based model is designed to
fuse the facial information with a visual physiological map. This information is used as
nodes with previously extracted features by ResNet18. Each feature is connected with the
physiological map and calculates the cosine similarity. By training the GCN model and
ResNet18, the model can combine two modalities and generate visual representations with
physiological signals.
Figure 13. Two approaches combining visual representations with physiological signals Reprinted
with permission from Ref. [50], 2024, Stefanov, K.
positive and negative samples are truly or falsely predicted. The confidence score of the
current sample is used as the threshold by traversing all samples. Through multiple sets
of thresholds, true-positive rate and false-positive rate pairs are calculated separately to
draw a curve, namely, the ROC curve. Thus, the area under this curve is the area under
the curve (AUC). Once the ROC is drawn, the equal error rate can be defined as the point
where the false-negative rate equals the false-positive rate. In conclusion, the larger the
accuracy and AUC values, the better the classification performance of the model. The
smaller the EER value, the better the model performs.
This experiment result proves that the testing accuracy of deepfakes is quite dependent
on the training data themselves by utilizing Capsule Network, and the performance indica-
tors of the forgery class are often greatly reduced in cross-dataset experiments, even though
we activated the dropout layer with a rate of 0.5. This is because the quality of databases is
different from each other, and we conducted an experiment combining some Celeb samples
into the training datasets. The accuracy of the testing results increased a lot, reaching 90%
above as well. Thus, we can make two bold assumptions that (1) the network formed by
stacking CBL modules and without any advanced data argumentation is not very effective
in extracting the features of all kinds of manipulated regions, and (2) the loss that is only
under the guidance of cross-entropy cannot provide reliable classification results.
3.5.4. CORE
The experiment utilized FF++ as the training set, which includes DeepFakes, Face2Face,
FaceSwap, Neural Textures, and one extra class of deepfake detection. Each video split
30 frames with the relative FF++ facial masks to obtain human facial regions, and the
training set and testing set strictly followed the FF++ data split requirements, which
had 86,263 fake images for the training phase and 16,641 fake images for the test. The
experiments explored and set a balance weight and cosine consistency loss to present the
benchmark area under the curve (AUC), which reached 99.96% on the FaceForensics++
in-dataset test in 30 epochs. The cross-dataset test results reached 72.41% and 75.72% on
DFDC and Celeb-DF, respectively. These results were also confirmed by our experiments.
From a previous literature review of the consistency representation method, this
method supposes the generalization ability of one detection model depends on the con-
sistency of its predicted results; that is, the real samples can correctly be predicted as real
samples, no matter the data augmented. Thus, it is necessary to complete an ablation
experiment to prove that the consistency of model prediction can be guaranteed to the
largest extent by a specific data augmentation method. As above, we acknowledged that dif-
ferent augmentation methods and balance weights will present different evaluating results.
Thus, we tested the ablation test on data augmentation, including RaAug, DFDC-Selim,
and Random Erasing, finding that the best performance of CORE’s data argumentation is
DFDC-Selim [53]. The reproduced test results are shown in Table 7.
From the cross-dataset experiment on Celeb-DF, this detection method indeed largely
enhances the ability of generalization, but the performance is largely determined by data
augmentation methods and cannot reach a more reliable evaluating metric.
Figure 14. Some fake samples with obvious forgery clues are wrongly predicted as “Real”.
Author Contributions: Conceptualization, L.Y.G. and X.J.L.; methodology, L.Y.G. and X.J.L.; software,
L.Y.G.; validation, L.Y.G.; formal analysis, L.Y.G.; investigation, L.Y.G.; resources, L.Y.G. and X.J.L.;
data curation, L.Y.G. and X.J.L.; writing—original draft preparation, L.Y.G. and X.J.L.; writing—
review and editing, L.Y.G. and X.J.L.; visualization, L.Y.G. and X.J.L.; supervision, X.J.L.; project
administration, X.J.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The relative datasets are downloaded from [8,13]. The reproduced
codes are from [3,11,15,26,27,53].
Conflicts of Interest: The authors declare no conflicts of interest.
References
1. Abdulreda, A.S.; Obaid, A.J. A landscape view of deepfake techniques and detection methods. Int. J. Nonlinear Anal. Appl. 2022,
13, 745–755.
2. Zhang, L.; Lu, T. Overview of Facial Deepfake Video Detection Methods. J. Front. Comput. Sci. Technol. 2022, 17, 1–26.
3. FaceSwap-GAN. Available online: https://github.com/shaoanlu/faceswap-GAN (accessed on 15 December 2018).
4. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Wared-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial
Network. Proceeding Commun. ACM 2018, 63, 139–144. [CrossRef]
5. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks.
In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015.
6. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th International Conference on
Neural Information Processing System, Red Hook, NY, USA, 6–12 December 2020; pp. 6840–6851.
7. Rana, M.S.; Nobi, M.N.; Murali, B.; Sung, A.H. Deepfake Detection: A Systematic Literature Review. IEEE Access 2022, 10,
25494–25513. [CrossRef]
8. Kaggle. Available online: https://www.kaggle.com/c/deepfake-detection-challenge/overview (accessed on 12 December 2023).
9. Dimensions Scholarly Database. Available online: https://app.dimensions.ai/ (accessed on 10 December 2023).
10. DeepFaceLive. Available online: https://github.com/iperov/DeepFaceLive (accessed on 9 November 2023).
11. Roop. Available online: https://github.com/s0md3v/roop (accessed on 11 October 2023).
12. Li, Y.Z.; Yang, X.; Sun, P.; Qi, H.G.; Lyu, S. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In Proceedings
of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020.
13. Zhou, T.F.; Wang, W.G.; Liang, Z.Y.; Shen, J.B. Face Forensics in the Wild. In Proceedings of the Computer Vision and Pattern
Recognition (CVPR), Virtual, 19–25 June 2021.
14. Guo, J.; Deng, J.; Lattas, A.; Zafeirioul, S. Sample and Computation Redistribution for Efficient Face Detection. In Proceedings of
the Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020.
15. FaceSwap. Available online: https://github.com/deepfakes/faceswap (accessed on 10 November 2020).
16. Tolosana, R.; Romero-Tapiador, S. DeepFakes Evolution: Analysis of Facial Regions and Fake Detection Performance. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020.
17. Dang, H.; Liu, F.; Stehouwer, J.; Liu, X.; Jain, A. On the Detection of Digital Face Manipulation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020.
18. Karras, T.; Laine, S.; Aila, A. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the
Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019.
19. Li, Y.; Chang, M.C.; Lyu, S. In Ictu Oculi: Exposing AI Created Fake Videos by Detecting Eye Blinking. In Proceedings of the IEEE
International Workshop on Information Forensics and Security, Hong Kong, China, 11–13 December 2018; pp. 1–7.
20. Nguyen, T.T.; Nguyen, Q.; Nguyen, D.; Nguyen, D.T.; Huynh-The, T.; Nahavandi, S.; Nguyen, T.; Pham, Q.; Nguyen, C. Deep
Learning for Deepfakes Creation and Detection: A Survey. arXiv 2022, arXiv:1909.11573.
21. Alahamari, F.; Naim, A.; Alqahtani, H. IoT-enabled Convolutional Neural Networks: Techniques and Applications. Chap-
ter: E-Learning Modelling Technique and Convolution Neural Networks in Online Education. 2023. Available online:
https://www.taylorfrancis.com/chapters/edit/10.1201/9781003393030-10/learning-modeling-technique-convolution-neural-
networks-online-education-fahad-alahmari-arshi-naim-hamed-alqa (accessed on 5 January 2024).
22. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic Routing between Capsules. In Proceedings of the 31st International Conference on
Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3859–3869.
23. Nguyen, H.H.; Yamagishi, J.; Echizen, I. Use of a Capsule Network to Detect Fake Images and Videos. In Proceedings of the
Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019.
24. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the
Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014.
Electronics 2024, 13, 585 21 of 22
25. ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Available online: https://image-net.org/challenges/LSVRC/
(accessed on 13 December 2023).
26. Capsule-Forensics-v2: Implementation of the Capsule-Forensics-v2. Available online: https://github.com/nii-yamagishilab/
Capsule-Forensics-v2 (accessed on 29 October 2019).
27. Ni, Y.; Meng, D.; Yu, C.; Quan, C.B.; Ren, D.; Zhao, Y. CORE: Consistent Representation Learning for Face Forgery Detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022.
28. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing data argumentation. In Proceedings of the AAAI, New York, NY,
USA, 7–12 February 2020.
29. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the Computer Vision and Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017.
30. Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for More General Face Forgery Detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June
2020; pp. 5000–5009.
31. Li, Y.; Lyu, S. Exposing DeepFake Videos By Detecting Face Warping Artifacts. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019; pp. 16–20.
32. Zhao, T.; Xu, X.; Xu, M.; Ding, H.; Xiong, Y.; Xia, W. Learning Self-Consistency for Deepfake Detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021.
33. Chen, M.; Fridrich, J.; Goljan, M.; Lukas, J. Determining Image Origin and Integrity Using Sensor Noise. IEEE Trans. Inf. Forensics
Secur. 2008, 3, 74–90. [CrossRef]
34. Barni, M.; Bondi, L.; Bonettini, N.; Bestagini, P.; Costanzo, A.; Maggini, M.; Tondi, B.; Tubaro, S. Aligned and Non-Aligned Double
JPEG Detection Using Convolutional Neural Networks. J. Vis. Commun. Image Represent. 2017, 49, 153–163. [CrossRef]
35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
36. Sun, K.; Yao, T.; Chen, S.; Ding, S.; Li, J.; Ji, R. Dual Contrastive Learning for General Face Forgery Detection. In Proceedings of
the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022; Volume 36, pp. 2316–2324.
37. Fridrich, J.; Kodovsky, J. Rich Models for Steganalysis of Digital Images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882.
[CrossRef]
38. Gutmann, M.; Hyvarinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010;
Volume 9, pp. 297–304.
39. Shi, X.; Chen, Z.; Wang, H.; Yeung, D. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Network.
In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015.
40. Gu, Z.; Chen, Y.; Yao, T.; Ding, S.; Li, J.; Huang, F.; Ma, L. Spatiotemporal Inconsistency Learning for Deepfake Video Detection.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021.
41. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International
Conference on Learning Representations, Virtual, 26 April–1 May 2020.
42. Khan, S.A.; Dai, H. Video Transformer for Deepfake Detection with Incremental Learning. In Proceedings of the 29th ACM
International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 1821–1828.
43. Wodajo, D.; Atnafu, S. Deepfake Video Detection Using Convolutional Vision Transformer. In Proceedings of the Computer
Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021.
44. Khormali, A.; Yuan, J. DFDT: An End-to-End DeepFake Detection Framework Using Vision Transformer. Appl. Sci. 2022, 12, 2953.
[CrossRef]
45. Trockman, A.; Zico Kolter, J. Patches Are All You Need? In Proceedings of the Computer Vision and Pattern Recognition, New
Orleans, LA, USA, 18–24 June 2022.
46. Zhao, C.; Wang, C.; Hu, G.; Chen, H.; Liu, C.; Tang, J. ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake
Detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1335–1348. [CrossRef]
47. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with Relative Position Representations. In Proceedings of the NAACL 2018,
New Orleans, LA, USA, 1–6 June 2018.
48. Ciftci, U.A.; Demir, İ.; Yin, L. How Do the Hearts of Deep Fakes Beat? Deep Fake Source Detection via Interpreting Residuals
with Biological Signals. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 8
September–1 October 2020; pp. 1–10.
49. Wu, J.; Zhu, Y.; Jiang, X.; Liu, Y.; Lin, J. Local attention and long-distance interaction of rPPG for deepfake detection. In
Proceedings of the Visual Computer, Lake Tahoe, NV, USA, 16–18 October 2023.
50. Stefanov, K.; Paliwal, B.; Dhall, A. Visual Representation of Physiological Signals for Fake Video Detection. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022.
51. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th Interna-
tional Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017.
Electronics 2024, 13, 585 22 of 22
52. Liu, X.; Fromm, J.; Patel, S.N.; McDuff, D. Multi-task temporal shift attention networks for on-device contactless vital measure-
ments. Adv. Neural Inf. Process. Syst. 2020, 33, 19400–19411.
53. DFDC-Selium. Available online: https://github.com/selimsef/dfdc_deepfake_challenge (accessed on 11 December 2022).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.