CN112861671A

CN112861671A - Method for identifying deeply forged face image and video

Info

Publication number: CN112861671A
Application number: CN202110110096.7A
Authority: CN
Inventors: 李斌; 周世杰; 张家亮; 贾宇; 邹严
Original assignee: University of Electronic Science and Technology of China
Current assignee: Sichuan Hongjieqi Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-28
Anticipated expiration: 2041-01-27
Also published as: CN112861671B

Abstract

The invention provides a method for identifying a deeply forged face image and a deeply forged video, which comprises the following steps: s1, collecting a mixed training sample; s2, constructing an identification model; the identification model comprises two 2D deep convolution neural networks and a 3D deep convolution neural network; s3, training the identification model by using the mixed training sample; and S4, identifying the face video to be identified by using the trained identification model. The invention proposes three improvements: (1) the generalization performance is improved by adopting a mixed training sample; (2) the face center cutting images of the large edge and the small edge are adopted to train two 2D depths, the depth convolution neural network of the prediction robustness is improved, and the prediction robustness is improved; (3) the 3D deep convolution neural network can utilize the interframe consistency information, so that the information utilization rate is improved; therefore, the method and the device can solve the problem that the discrimination capability of the prior art on the novel forged video is poor.

Description

Method for identifying deeply forged face image and video

Technical Field

The invention particularly relates to a method for identifying a deeply forged face image and a deeply forged video.

Background

Detection methods for counterfeit videos can be divided into two categories: firstly, a method (Temporal features across frames) based on inter-frame time characteristics utilizes time-related characteristics such as human blinking frequency and mouth shape in a video to judge, and a recursive classification method is generally used; secondly, based on the method of the Visual effect in the frame (Visual objects with frame), the method utilizes the flaws of the image edge and the unnatural details of the position of five sense organs, facial shadow and the like to judge, and usually extracts the specific features and then completes the detection by using a deep layer classifier or a shallow layer classifier.

In addition, researchers have proposed tracing of depth-forged video using traceable, non-tamperable blockchain techniques. In 2019, researchers of the electrical and computer engineering system of harry university of the academy headquarters, arabian consortium published a paper named Using blockchains and intelligent contract attack depth forgery Videos (Combating fake Videos Using blockchains and Smart contacts), and a solution and a general framework for Using blockchains are proposed to track the source and history of digital content, and the digital content can be tracked even if copied for multiple times. The solution framework provided by the paper is universal and can be applied to any other form of digital content.

The specific achievement aspect is as follows:

in 8 months in 2017, a network security group of the singapore information communication research institute published a paper named automatic face exchange and detection (automatic face swapping and its detection), an AI face exchange detection frame was proposed for the first time, and the detection accuracy rate reached 92%. Since then, the research on the artificial intelligence face changing technology and the detection technology in the industry enters the stage of enthusiasm, and enterprises, universities and individual developers invest in the development of artificial intelligence face changing detection tools.

In 2019, researchers at Berkeley university of California and university of California in the United states collected personal characteristics in videos through existing non-forged videos, and a highly personalized 'soft biometric identification index' system was constructed. After the identification system grasps personal micro-expression and behavior habits, the false identification accuracy can reach 95%. Adobe corporation of america also introduced a reverse PS (Photoshop, the most widely used cartography software worldwide, here meaning "edit pictures") tool in 2019, 6 months. By means of an AI algorithm, the tool can automatically identify the part of the portrait picture modified by the image liquefaction tool and restore the portrait picture to an initial appearance, and the accuracy is as high as 99%.

To help researchers develop automatic detection tools for depth forgery, google, inc, published a recognition data set of depth forgery videos including 3000 segments of videos shot by multiple real actors in 28 different scenes in 9 months 2019. Global researchers can use this fully open source data set to train a deep forgery detection tool.

However, the above-mentioned techniques only identify a single image, and do not consider the context information in the video, and the neural network cannot automatically utilize the inter-frame information, so that the inference from the inter-frame consistency cannot be made. Because the method and the variable of the real-world deep-counterfeit video cannot be exhausted, and the algorithm for counterfeiting the video is continuously improved, new algorithms are continuously proposed, and the characteristics and the counterfeiting points of the real-world deep-counterfeit video are obviously different from the counterfeit data set manufactured in the industry at present. The generalized performance of the generated model is poor and the discrimination capability of the novel forged video is poor by using a common classification convolutional neural network training method on the forged data sets.

Disclosure of Invention

The invention aims to provide a method for identifying a deeply forged face image and a deeply forged video, so as to solve the problems in the prior art.

The invention provides a method for identifying a deeply forged face image and a deeply forged video, which comprises the following steps:

s1, collecting a mixed training sample;

s2, constructing an identification model; the identification model comprises two 2D deep convolution neural networks and a 3D deep convolution neural network;

s3, training the identification model by using the mixed training sample;

and S4, identifying the face video to be identified by using the trained identification model.

Further, the method for collecting the hybrid training sample in step S1 includes:

s11, collecting a large number of deep forged videos and original videos corresponding to the deep forged videos to form a training data set;

s12, detecting the first face position in each frame of each depth forged video by using a face detection method, randomly intercepting segments with the length of L in continuous frames with forged faces, and cutting out face frames by using first face position information to form depth forged face segments;

s13, detecting the position of a second face in each frame of the original video corresponding to each depth forged video by using a face detection method, randomly intercepting segments with the length of L in continuous frames with faces, and cutting face frames by using second face position information to form face segments of the original video;

s14, taking a frame F in the depth fake face fragment and a corresponding frame R in the original video face fragment, and weighting and adding the frame F and the corresponding frame R to form a mixed face image;

and S15, forming a mixed face image by the method of the step S14 on all the deeply forged face fragments and the corresponding original video face fragments, and obtaining a mixed training sample.

Further, the frame F and the corresponding frame R in step S14 are weighted to fit into a certain distribution of [0,1] random samples.

Further, in step S2, the convolution kernel of each 2D deep convolutional neural network is 2D, the backbone network is a common deep convolutional neural network, and the full connection layer is a 2-class structure.

Further, the method for training the 2D deep convolutional neural network in step S3 includes:

(1) extracting a frame of mixed face image from the mixed training sample, performing center clipping to enable the face to be far away from the edge of the mixed face image, and then repeatedly using a first 2D depth convolution neural network to perform forward and backward propagation training on the mixed face image subjected to center clipping;

(2) extracting a frame of mixed face image from the mixed training sample, performing center clipping to enable the face to be close to the edge of the mixed face image, and then repeatedly using a first 2D depth convolution neural network to perform forward and backward propagation training on the mixed face image subjected to center clipping;

further, in step S2, the 3D deep convolutional neural network is based on a 2D deep convolutional neural network, and the convolution kernel thereof is replaced by a 3D convolution kernel, so that the 3D deep convolutional neural network has the capability of performing convolution between video frames.

Further, the method for training the 3D deep convolutional neural network in step S3 is to extract several consecutive frames of mixed face images from the mixed training sample, and then train several consecutive frames of mixed face images by repeating the forward and backward propagation using the 3D deep convolutional neural network.

Further, step S4 includes the following sub-steps:

s41, randomly extracting a video frame segment from the face video to be identified;

s42, identifying the face in each frame in the video frame segment by using the two trained 2D depth convolution neural networks;

s43, identifying each frame in the video frame segment by using the trained 3D deep convolution neural network;

and S44, using a weighted integration method as the identification result for the identification predicted values of the two 2D deep convolutional neural networks and the 3D deep convolutional neural network.

Further, the weighted integration method described in step S44 uses the weight as the confidence of identifying the predicted value; the confidence is the distance between the discrimination prediction value and 0.5.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention proposes three improvements: (1) the generalization performance is improved by adopting a mixed training sample; (2) the face center cutting images of the large edge and the small edge are adopted to train two 2D depths, the depth convolution neural network of the prediction robustness is improved, and the prediction robustness is improved; (3) the 3D deep convolution neural network can utilize the interframe consistency information, so that the information utilization rate is improved; therefore, the method and the device can solve the problem that the discrimination capability of the prior art on the novel forged video is poor.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a method for identifying a deeply forged face image and a video according to an embodiment of the present invention

Fig. 2 is a schematic diagram of collecting hybrid training samples according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of identifying a video of a face to be identified by using a trained identification model according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

Referring to fig. 1, the present embodiment provides a method for identifying a deeply forged face image and a deeply forged video, including the following steps:

s1, collecting a mixed training sample;

referring to fig. 2, the method for collecting the hybrid training samples in step S1 includes:

s14, taking a frame F in the depth fake face fragment and a corresponding frame R in the original video face fragment, and weighting and adding the frame F and the corresponding frame R to form a mixed face image; in some embodiments, the frame F and the corresponding frame R are weighted and summed to a weighted sum that is a [0,1] random sample that fits a distribution, such as a normal distribution;

This step S1 can generate a novel hybrid training sample based on the original video and the depth-forged video using a data augmentation method, and can improve generalization performance.

for a 2D deep convolutional neural network, a convolution kernel of each 2D deep convolutional neural network in this embodiment is 2D, a backbone network is a common deep convolutional neural network, and a full connection layer is a 2-class structure. The 2D depth convolution neural network is used for identifying whether a single image is subjected to depth forgery or not.

For a 3D deep convolutional neural network, the 3D deep convolutional neural network of this embodiment is based on a 2D deep convolutional neural network, and its convolution kernel is replaced by a 3D convolution kernel, so that it has the capability of performing convolution between video frames. The 2D depth convolution neural network is used for identifying whether continuous frame images are subjected to depth forgery or not.

S3, training the identification model by using the mixed training sample;

for a 2D deep convolutional neural network, the method for training the 2D deep convolutional neural network in this embodiment is as follows:

in the process of training the 2D deep convolutional neural network, the face center cutting images of the large edge and the small edge are adopted for training respectively, so that the prediction robustness can be improved.

For the 3D deep convolutional neural network, the method for training the 3D deep convolutional neural network in this embodiment is to extract several continuous frames of mixed face images from the mixed training sample, and then to repeatedly use the 3D deep convolutional neural network to train the several continuous frames of mixed face images in forward and backward propagation. When the 3D deep convolution neural network identifies the frames in the video, the front and rear frames of the frames are used as references, so that the inter-frame consistency can be utilized, and the information utilization rate is improved.

S4, identifying the face video to be identified by using the trained identification model;

as shown in fig. 3, step S4 includes the following sub-steps:

s41, randomly extracting a video frame segment with the length L from the face video to be identified;

and S44, using a weighted integration method as the identification result for the identification predicted values of the two 2D deep convolutional neural networks and the 3D deep convolutional neural network. The weighted integration method uses a weight that is the confidence of identifying the predicted value. The neural network outputs a discrimination prediction value of a section of video, the discrimination prediction value is between (0, 1), the closer to 1 represents that the neural network considers the video to be more likely to be fake video, and the closer to 0 represents that the video is more likely to be real video. The confidence level near 0 or 1 is high, and the confidence level near 0.5 is low, so that the confidence level in this embodiment is the distance between the discrimination prediction value and 0.5.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying a deeply forged face image and a deeply forged video is characterized by comprising the following steps:

s1, collecting a mixed training sample;

s3, training the identification model by using the mixed training sample;

2. The method for identifying the deep forged face images and videos as claimed in claim 1, wherein the method for collecting the mixed training samples in step S1 comprises:

3. The method for authenticating deep forged face images and videos as claimed in claim 2, wherein the weighted sum of the frame F and the corresponding frame R in the step S14 is weighted to fit into [0,1] random samples conforming to a certain distribution.

4. The method for identifying the deep forged face images and videos according to claim 2, wherein in step S2, the convolution kernel of each 2D deep convolutional neural network is 2D, the backbone network is a common deep convolutional neural network, and the full connection layer is a 2-class structure.

5. The method for identifying the deep forged face images and videos as claimed in claim 4, wherein the method for training the 2D deep convolutional neural network in step S3 is as follows:

(2) and extracting a frame of mixed face image from the mixed training sample to perform center clipping so that the face is close to the edge of the mixed face image, and then repeatedly using the first 2D deep convolutional neural network to perform forward and backward propagation training on the mixed face image subjected to center clipping.

6. The method for authenticating deep forged face images and videos as claimed in claim 4, wherein the 3D deep convolutional neural network in step S2 is based on a 2D deep convolutional neural network, and replaces the convolution kernel thereof with a 3D convolution kernel, so that the 3D deep convolutional neural network has the capability of performing convolution between video frames.

7. The method for discriminating the deep forged face images and videos as claimed in claim 6, wherein the method for training the 3D deep convolutional neural network in step S3 is to sequentially extract several consecutive frames of mixed face images from the mixed training samples, and then repeatedly use the 3D deep convolutional neural network to train the several consecutive frames of mixed face images to propagate forwards and backwards.

8. The method for discriminating deep forged face images and videos as claimed in claim 1, wherein the step S4 includes the sub-steps of:

9. The method for discriminating deep forged face images and videos as claimed in claim 8, wherein the weight used in the weighted integration method in step S44 is the confidence of discrimination prediction value; the confidence is the distance between the discrimination prediction value and 0.5.