CN111986180B

CN111986180B - Face forgery video detection method based on multi-correlated frame attention mechanism

Info

Publication number: CN111986180B
Application number: CN202010851718.7A
Authority: CN
Inventors: 张勇东; 胡梓珩; 谢洪涛
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2021-07-06
Anticipated expiration: 2040-08-21
Also published as: CN111986180A

Abstract

The invention discloses a face forgery video detection method based on a multi-correlation frame attention mechanism. For a video, a multi-stream structure is adopted, and multi-frames are used as input. Design an inter-frame attention mechanism to calculate the dynamic correlation information between the feature streams of each frame, fuse with the intra-frame static features of the target frame and the dynamic features between frames, as the basis for prediction, and judge whether there is a human presence from the overall perspective of the video. face tampering. It can improve the detection accuracy of forged videos, and at the same time, it is robust to image quality degradation and new tampering methods.

Description

Face forged video detection method based on multi-correlation frame attention mechanism

Technical Field

The invention relates to the technical field of counterfeit video detection, in particular to a face counterfeit video detection method based on a multi-correlation frame attention mechanism.

Background

With the development of deep learning, especially GAN (generative countermeasure network) and other technologies, many video face tampering methods and programs have recently been generated, which can replace the face of an original person in a video with the face of another person, or tamper the expression of the person, while maintaining the visual reality of the video. The procedures are simple to operate, the produced video has vivid effect, is difficult to distinguish by common people, and can generate adverse effects of law and morality if being used maliciously, so that an effective counterfeit video detection method is urgently needed at present.

The existing detection technology of fake video aiming at human face is mainly divided into two types: (1) the method is based on a single-frame image in a video, but the method does not consider time domain information of the video, converts the classification problem of the video into the classification problem of the image, resolves a large number of real videos and forged videos into frame images, uses the frame images as a training data set, designs various network structures to train the real and fake images to obtain two classifiers, extracts a plurality of frames from the video to be detected, and respectively gives a prediction result. (2) A frame sequence based method is characterized in that a plurality of frames of a video are sent into a network, and the features of the frames are fused by means of RNN, LSTM and the like to give a binary classification result. The above prior art methods have achieved some basic effects, but have some problems: in the method (1), during model training, the precision is improved quickly to reach a high level, but during testing, the effect is greatly reduced. The method (2) when the video quality is poor, the detection precision is poor; especially, in the process of spreading the forged video on the internet, the forged video can be repeatedly forwarded and compressed, the image quality is reduced, the tampering trace becomes fuzzy, and the detection difficulty is further increased.

Disclosure of Invention

The invention aims to provide a face counterfeit video detection method based on a multi-correlation frame attention mechanism, which can improve the detection precision of counterfeit videos and has robustness on the reduction of the image quality of detected videos or the use of a new tampering mode

The purpose of the invention is realized by the following technical scheme:

a face forgery video detection method based on a multi-correlation frame attention mechanism comprises the following steps:

decoding a video to be detected into a frame sequence, and extracting a face image of each frame;

selecting a frame as a target frame, selecting N reference frames before and after the target frame, performing feature extraction on a face image in a 2N +1 frame, respectively calculating inter-frame attention information between image features of the target frame and the image features of each reference frame, respectively calculating an average value of the inter-frame attention information before and after the target frame, so as to obtain the attention information before and after the target frame, and then fusing the image features of the target frame with the attention information before and after the target frame;

and predicting based on the fusion result, so as to judge whether the video to be detected is the face forged video based on the prediction result of the whole video angle.

According to the technical scheme provided by the invention, a multi-stream structure is adopted for one video, and multiple frames are used as input. And designing an inter-frame attention mechanism, calculating dynamic association information among the characteristic streams of each frame, fusing the dynamic association information with the intra-frame static characteristics of a target frame, and judging whether human face tampering exists from the overall perspective of the video by taking the dynamic association information as a prediction basis. The method can improve the detection precision of the forged video, and has robustness to the degradation of the image quality of the detected video or the use of a new tampering mode.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic overall flow chart of a face-forged video detection method based on a multi-correlation frame attention mechanism according to an embodiment of the present invention;

fig. 2 is a flowchart of a related frame fusion work of variable frame number in a testing stage according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an inter-frame attention mechanism provided by an embodiment of the present invention;

FIG. 4 is a diagram of a prediction module network architecture according to an embodiment of the present invention;

fig. 5 is a schematic diagram illustrating a visualization result of a convolutional layer according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Because the existing method mainly depends on the characteristics in a single frame, the dynamic relation among video frames is not mined, and if the whole video is viewed, some tampering marks are dynamically shown. The reason is that when the video is tampered, the face in each frame is fused into the original frame after the face in each frame is tampered, so that the tampering traces of all frames of the same video are not identical, and the video can be dynamically displayed. For the reasons, it is necessary to use inter-frame information of the video during detection, and the effect of detecting the tampered video can be improved by mining dynamic features by using the inter-frame information. Based on this, an embodiment of the present invention provides a face-forged video detection method based on a multi-correlation frame attention mechanism, as shown in fig. 1, which mainly includes:

1. for the video to be detected, decoding is carried out to form a frame sequence, and a face image of each frame is extracted.

In the embodiment of the invention, the DLIB (or other face recognition modules) can be used for detecting the face in each frame; illustratively, the face area may also be enlarged by 1.3 times, and the size may be set to 3 × 224 × 224.

2. Selecting a frame from a frame sequence as a target frame, selecting N reference frames before and after the target frame, extracting the features of a face image in a 2N +1 frame, respectively calculating inter-frame attention information between the image features of the target frame and the image features of each reference frame, respectively calculating an average value of the inter-frame attention information before and after the target frame, thereby obtaining the attention information before and after the target frame, and then fusing the image features of the target frame with the attention information before and after the target frame.

In the embodiment of the invention, the image can be input into a feature extraction network taking ResNet50 as a backbone to obtain corresponding image features; illustratively, the size of the feature extraction net output may be 256 × 29 × 29 (corresponding to C × H × W below).

The feature extraction network comprises 5 Bottleneck layers (Bottleneck layers) layer 1-layer 5. Each of the Bottleneck layers has three convolution layers and a BatchNormalization layer and a ReLU. The feature map generated by the last layer, layer5, is used as the output of the feature extraction module.

Preferred embodiments of this step are given below:

a training stage:

in the model training phase, N is 1, but it is needless to say that model training may be performed using other numerical values, and the description will be given by taking N as 1 only as an example.

For ease of understanding, fig. 1 gives a specific example, i.e. three frames of images are selected, the middle frame being called the target frame, denoted F₁The other two frames are reference frames and are respectively marked as F₂And F₃. The image features extracted using the feature extraction network are denoted as V₁，V₂，V₃These are 3-dimensional matrices having a size of C × H × W, wherein C, H, W represents the number of channels, height, and width, respectively. To facilitate subsequent calculations, they are transformed into a 2-dimensional matrix of C × HW; thereafter, V is calculated based on the inter-frame attention mechanism as shown in FIG. 3₁And V₂，V₁And V₃Similarity matrix A between₁₂、A₁₃：

Wherein W is a weight parameter matrix of C × C, and the obtained similarity matrix A₁₂、A₁₃The size of (A) is WH × WH. The attention diagram Z is then recalculated₁₂And attention-seeking drawing Z₁₃(ii) a With Z₁₂The relevant formula is given for example:

Z₁₂＝V₂A₁₂

target frame feature vector V₁Obtaining G through a convolution layer₁Two attention diagrams Z₁₂、Z₁₃Obtaining attention information I by a convolution layer and using softmax normalization₁₂And I₁₃：

As will be understood by those skilled in the art, the convolution layer can be mathematically expressed in terms of a weight W, plus an offset b, such that

And

and

for processing the weight and offset of the feature vector of the target frame and the two convolutional layers in the attention diagram, K is the number of convolutional layer convolutional kernels, namely the output G₁、I₁₂And I₁₃Has K channels.

Since N is 1 in this example of the training phase, the average value does not need to be calculated, and the fusion can be directly performed by respectively fusing G₁And I₁₂、G₁And I₁₃Multiplying corresponding channels, and then cascading to obtain:

where K represents the number of convolution kernels in the convolution layer.

And (3) information fusion of a variable number of related frames in a test stage:

as shown in fig. 2, a workflow diagram of the test phase information fusion is shown (wherein Cross-attention is the inter-frame attention mechanism shown in fig. 3). In the target frame F_tThe N reference frames selected before and after the frame are respectively marked as { F_b1，F_b2，...，F_bNAnd { F }_a1，F_a2，...，F_aN}. Feature extraction of the face image in the 2N +1 frame, denoted asV_t、{V_b1，V_b2，...，V_bNAnd { V }_a1，V_a2，...，V_aN}。

Respectively calculating the image features extracted from each reference frame and the target frame F_tThe extracted image feature V_tThe similarity matrix between them is expressed as:

wherein, V_mnRepresenting the image features extracted from a reference frame, m ═ a, b, N ═ 1, 2.., N;

each reference frame utilizes its own image feature V_mnWith corresponding similarity matrix A_mnCalculating an attention map Z_mnExpressed as:

Z_mn＝V_mnA_mn

characterizing the target frame by V_tAfter each attention map is passed through a convolution layer (with K convolution kernels) and normalized using softmax (fig. 2 does not show the figure, and according to fig. 3, the operations of convolution and softmax are in a cross-annotation module and are not shown, fig. 2 shows the feature vector V of the target frame_tConvolution operation of (d) to obtain inter-frame attention information:

target frame feature vector and attention map Z_mnAfter each convolution layer is processed, the convolution layers respectively have K channels, and i is a channel index.

Based on the scheme, inter-frame attention information between the image features of the target frame and the image features of the front and rear selected N reference frames is obtained: { I_b1，I_b2，...，I_bNAnd { I }_a1，I_a2，...，I_aN}。

Calculating the average value of the attention information between frames to obtain the attention information I before the frame of the target frame_bAnd post-frame attention information I_a：

G is to be_tAttention information I before frame of target frame_bAnd post-frame attention information I_aMultiplying and cascading corresponding channels to obtain a fusion result I_t：

3. And inputting the calculated fusion result into a prediction module, so as to judge whether the video to be detected is the face fake video or not based on the prediction result of the whole video angle.

In the embodiment of the present invention, a prediction module as shown in fig. 4 is used for prediction, and the prediction module includes: convolutional layer (capacitive), Average Pooling layer (Average Pooling), three Fully-connected layers (full-connected), and soffmax layer. The convolution layer output is connected to Batch Normalization (Batch Normalization) and the ReLU activation function. In order to prevent overfitting, dropout is added after the first two fully-connected layers, the random loss probability can be set to 0.75, and a ReLU activation function is also added.

The prediction module output is a two-dimensional vector p₀，p₁]The cross entropy loss function is used in the training phase and is expressed as:

wherein, [ y ]₀，y₁]A mark value representing a video in the training set, [ y ] when the video is an unforeseen video₀，y₁]＝[1.0，0.0](ii) a When the video is a fake video, [ y₀，y₁]＝[0.0，1.0]，[p₀，p₁]A two-dimensional vector output by the prediction module.

In the test phase, p is calculated₀/(p₀+p₁) As a prediction score. For example, a threshold value of 0.5 may be set, and when the prediction result is greater than 0.5, the video is determined to be a fake video, and when the prediction result is less than 0.5, the video is determined to be a real video.

In the embodiment of the present invention, the network model shown in fig. 1 may be implemented based on a PyTorch framework.

Illustratively, the training phase is optimized by a random gradient descent (SGD) optimizer with an initial learning rate of 0.001, weight decay of 0.0001, momentum of 0.95, batch size set to 12, and all parameters in the network are initialized with a gaussian distribution with a mean of 0 and variance of 0.001. The model was trained using a server equipped with 2 NVIDIARTX2080Ti GPUs, Intel Xeon E5-2695 CPUs, and Ubuntu16.04 operating system. As previously described, the update parameters are propagated back using 3 frames per video in the training phase using a cross-entropy loss function.

And after the model training is finished, testing, wherein the loss function is not calculated and back propagation is not performed in the testing stage, and the parameters are kept fixed. Different from the training phase, the number N of the selected relevant frames may not be 1, that is, the number of the fused inter-frame relevant information is variable.

The effects of the above method of the embodiment of the present invention are explained as follows:

the experiment was performed starting from 1 for the number of relevant frames N. When N is 1, it is equivalent to take one frame before and after the target frame (i.e. the example given above with reference to fig. 1), and then increase the number of N. Experiments show that the detection effect is best when N is 4 for a 10-second video, which is equivalent to taking 4 frames before and 4 frames after the target frame, adding the target frame to the 4 frames, and totaling 9 frames, and averaging 1 second frame. If the number of reference frames is continuously increased, the detection effect is not obviously improved. The inter-frame attention mechanism and the multi-frame structure designed by the method are effective, and the dynamic change of the inter-frame can be extracted, and the dynamic change mode of tampering the video fake trace is learned, so that the detection effect is improved. In implementation, the program and the trained model can be installed on a social media website or a background server of a short video application to detect the video uploaded by a user, so that forged videos made by various mainstream face tampering methods can be effectively detected, the authenticity of the uploaded video is ensured, and the propagation of false information by using the forged videos is prevented, so that adverse effects are avoided.

For some truer tampered videos, the face replacement effect is more vivid, the authenticity of the video is difficult to judge only from one frame of still image, but some flaws exist in dynamic playing. Compared with the prior art, the scheme discovers the dynamic change information between frames by calculating the attention diagrams among multiple frames. In addition, a multi-stream structure is also used, and a frame sequence of the video is used as input to calculate an attention diagram between every two frames, so that the inter-frame relation of the video is modeled, and the analysis is carried out from the perspective of the whole video. Based on the design, not only static marks of tampering are learned, but also dynamic change modes generated by tampering are learned, the detection performance is enhanced, and the problem that interframe information is not utilized in the existing method is solved.

The method achieves the most advanced effect when experiments are carried out on two face tampering video data sets of faceforces + + and Celeb-DF (V2). The main stream tampering methods, namely, Deepfaces, Face2Face and faceSwap, are accurate to more than 98%. The accuracy rate of the image quality is over 95 percent even in extremely low image quality (faceforces + + c 40). The model trained using faceforces + + c40 was tested on Celeb-DF (V2) and auc reached 70.4. Meanwhile, experiments prove that the detection precision can be improved by increasing the number N of related frames in the testing stage, the number of related frames is related to the length of the video to be detected, when one frame of related frame is taken about every second, the effect is best, more is obtained, the effect is slightly improved, but more computing resources are consumed. As can be seen from the visualization of the convolution layer after the attention information and the target frame feature vector are fused in fig. 5, the network focuses on the tampered parts with large dynamic changes, such as eyes and mouths, of a human, and plays a role in correct judgment.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A face forgery video detection method based on a multi-correlation frame attention mechanism is characterized by comprising the following steps:

predicting based on the fusion result, and judging whether the video to be detected is a face forged video based on the prediction result of the whole video angle;

wherein, respectively calculating inter-frame attention information between the image features of the target frame and the image features of each reference frame comprises: respectively calculating the image features extracted from each reference frame and the target frame F_tThe extracted image feature V_tA similarity matrix between; each reference frame calculates an attention diagram by using the image characteristics of the reference frame and the corresponding similarity matrix, and then inter-frame attention information is obtained.

2. The method for detecting the face-forged video based on the multi-correlation frame attention mechanism as claimed in claim 1, wherein the target frame F is a target frame_tThe N reference frames selected before and after the frame are respectively marked as { F_b1，F_b2，...，F_bNAnd { F }_a1，F_a2，...，F_aN}；

Extracting the features of the face image in the 2N +1 frame and marking as V_t、{V_b1，V_b2，...，V_bNAnd { V }_a1，V_a2，...，V_aN}。

3. The method for detecting face-forged video based on multi-correlation frame attention mechanism as claimed in claim 2,

the calculation formula of the similarity matrix is as follows:

wherein, V_mnRepresenting the image features extracted from a reference frame, m ═ a, b, N ═ 1, 2.., N; w is a weight parameter matrix;

the calculation attention force diagram is as follows:

Z_mn＝V_mnA_mn

after each attention map is passed through the convolutional layer and normalized using softmax, inter-frame attention information is obtained:

wherein,

weights, offsets of convolutional layers, each of which is a processing attention map; k represents the number of convolution kernels in the convolution layer and i is the channel index.

4. The method for detecting face-forged video based on multi-correlation frame attention mechanism as claimed in claim 1, 2 or 3,

the inter-frame attention information between the image feature of the target frame and the image features of the N selected reference frames before and after the target frame is expressed as { I_b1，I_b2，...，I_bNAnd { I }_a1，I_a2，...，I_aN}；

5. The method for detecting face-forged video based on multi-correlation frame attention mechanism as claimed in claim 1,

image characteristic V of target frame_tObtaining G through a convolution layer_tAnd then with the pre-frame attention information I of the target frame_bAnd post-frame attention information I_aMultiplying and cascading corresponding channels to obtain a fusion result I_t：

Wherein,

weights, offsets of convolution layers each processing a feature vector of a target frame; k represents the number of convolution kernels in the convolution layer and i is the channel index.

6. The method for detecting the video frequency forged by the human face based on the multi-correlation frame attention mechanism as claimed in claim 1, wherein the prediction is performed by a prediction module, and the prediction module comprises, in sequence: the multilayer film comprises a convolution layer, an average pooling layer, three full-connection layers and a softmax layer; the prediction module output is a two-dimensional vector p₀，p₁]Calculating p₀/(p₀+p₁) As a prediction score; comparing the prediction score with a threshold value, and if the prediction score exceeds the threshold value, determining that the video to be detected is a fake video; otherwise, the video is real video.

7. The method for detecting the video frequency forged by the human face based on the multi-correlation frame attention mechanism as claimed in claim 1 or 6, characterized in that, a cross entropy loss function is used in the training process, which is expressed as: