Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Because the existing method mainly depends on the characteristics in a single frame, the dynamic relation among video frames is not mined, and if the whole video is viewed, some tampering marks are dynamically shown. The reason is that when the video is tampered, the face in each frame is fused into the original frame after the face in each frame is tampered, so that the tampering traces of all frames of the same video are not identical, and the video can be dynamically displayed. For the reasons, it is necessary to use inter-frame information of the video during detection, and the effect of detecting the tampered video can be improved by mining dynamic features by using the inter-frame information. Based on this, an embodiment of the present invention provides a face-forged video detection method based on a multi-correlation frame attention mechanism, as shown in fig. 1, which mainly includes:
1. for the video to be detected, decoding is carried out to form a frame sequence, and a face image of each frame is extracted.
In the embodiment of the invention, the DLIB (or other face recognition modules) can be used for detecting the face in each frame; illustratively, the face area may also be enlarged by 1.3 times, and the size may be set to 3 × 224 × 224.
2. Selecting a frame from a frame sequence as a target frame, selecting N reference frames before and after the target frame, extracting the features of a face image in a 2N +1 frame, respectively calculating inter-frame attention information between the image features of the target frame and the image features of each reference frame, respectively calculating an average value of the inter-frame attention information before and after the target frame, thereby obtaining the attention information before and after the target frame, and then fusing the image features of the target frame with the attention information before and after the target frame.
In the embodiment of the invention, the image can be input into a feature extraction network taking ResNet50 as a backbone to obtain corresponding image features; illustratively, the size of the feature extraction net output may be 256 × 29 × 29 (corresponding to C × H × W below).
The feature extraction network comprises 5 Bottleneck layers (Bottleneck layers) layer 1-layer 5. Each of the Bottleneck layers has three convolution layers and a BatchNormalization layer and a ReLU. The feature map generated by the last layer, layer5, is used as the output of the feature extraction module.
Preferred embodiments of this step are given below:
a training stage:
in the model training phase, N is 1, but it is needless to say that model training may be performed using other numerical values, and the description will be given by taking N as 1 only as an example.
For ease of understanding, fig. 1 gives a specific example, i.e. three frames of images are selected, the middle frame being called the target frame, denoted F1The other two frames are reference frames and are respectively marked as F2And F3. The image features extracted using the feature extraction network are denoted as V1,V2,V3These are 3-dimensional matrices having a size of C × H × W, wherein C, H, W represents the number of channels, height, and width, respectively. To facilitate subsequent calculations, they are transformed into a 2-dimensional matrix of C × HW; thereafter, V is calculated based on the inter-frame attention mechanism as shown in FIG. 31And V2,V1And V3Similarity matrix A between12、A13:
Wherein W is a weight parameter matrix of C × C, and the obtained similarity matrix A12、A13The size of (A) is WH × WH. The attention diagram Z is then recalculated12And attention-seeking drawing Z13(ii) a With Z12The relevant formula is given for example:
Z12=V2A12
target frame feature vector V1Obtaining G through a convolution layer1Two attention diagrams Z12、Z13Obtaining attention information I by a convolution layer and using softmax normalization12And I13:
As will be understood by those skilled in the art, the convolution layer can be mathematically expressed in terms of a weight W, plus an offset b, such that
And
and
for processing the weight and offset of the feature vector of the target frame and the two convolutional layers in the attention diagram, K is the number of convolutional layer convolutional kernels, namely the output G
1、I
12And I
13Has K channels.
Since N is 1 in this example of the training phase, the average value does not need to be calculated, and the fusion can be directly performed by respectively fusing G1And I12、G1And I13Multiplying corresponding channels, and then cascading to obtain:
where K represents the number of convolution kernels in the convolution layer.
And (3) information fusion of a variable number of related frames in a test stage:
as shown in fig. 2, a workflow diagram of the test phase information fusion is shown (wherein Cross-attention is the inter-frame attention mechanism shown in fig. 3). In the target frame FtThe N reference frames selected before and after the frame are respectively marked as { Fb1,Fb2,...,FbNAnd { F }a1,Fa2,...,FaN}. Feature extraction of the face image in the 2N +1 frame, denoted asVt、{Vb1,Vb2,...,VbNAnd { V }a1,Va2,...,VaN}。
Respectively calculating the image features extracted from each reference frame and the target frame FtThe extracted image feature VtThe similarity matrix between them is expressed as:
wherein, VmnRepresenting the image features extracted from a reference frame, m ═ a, b, N ═ 1, 2.., N;
each reference frame utilizes its own image feature VmnWith corresponding similarity matrix AmnCalculating an attention map ZmnExpressed as:
Zmn=VmnAmn
characterizing the target frame by VtAfter each attention map is passed through a convolution layer (with K convolution kernels) and normalized using softmax (fig. 2 does not show the figure, and according to fig. 3, the operations of convolution and softmax are in a cross-annotation module and are not shown, fig. 2 shows the feature vector V of the target frametConvolution operation of (d) to obtain inter-frame attention information:
target frame feature vector and attention map ZmnAfter each convolution layer is processed, the convolution layers respectively have K channels, and i is a channel index.
Based on the scheme, inter-frame attention information between the image features of the target frame and the image features of the front and rear selected N reference frames is obtained: { Ib1,Ib2,...,IbNAnd { I }a1,Ia2,...,IaN}。
Calculating the average value of the attention information between frames to obtain the attention information I before the frame of the target framebAnd post-frame attention information Ia:
G is to betAttention information I before frame of target framebAnd post-frame attention information IaMultiplying and cascading corresponding channels to obtain a fusion result It:
3. And inputting the calculated fusion result into a prediction module, so as to judge whether the video to be detected is the face fake video or not based on the prediction result of the whole video angle.
In the embodiment of the present invention, a prediction module as shown in fig. 4 is used for prediction, and the prediction module includes: convolutional layer (capacitive), Average Pooling layer (Average Pooling), three Fully-connected layers (full-connected), and soffmax layer. The convolution layer output is connected to Batch Normalization (Batch Normalization) and the ReLU activation function. In order to prevent overfitting, dropout is added after the first two fully-connected layers, the random loss probability can be set to 0.75, and a ReLU activation function is also added.
The prediction module output is a two-dimensional vector p0,p1]The cross entropy loss function is used in the training phase and is expressed as:
wherein, [ y ]0,y1]A mark value representing a video in the training set, [ y ] when the video is an unforeseen video0,y1]=[1.0,0.0](ii) a When the video is a fake video, [ y0,y1]=[0.0,1.0],[p0,p1]A two-dimensional vector output by the prediction module.
In the test phase, p is calculated0/(p0+p1) As a prediction score. For example, a threshold value of 0.5 may be set, and when the prediction result is greater than 0.5, the video is determined to be a fake video, and when the prediction result is less than 0.5, the video is determined to be a real video.
In the embodiment of the present invention, the network model shown in fig. 1 may be implemented based on a PyTorch framework.
Illustratively, the training phase is optimized by a random gradient descent (SGD) optimizer with an initial learning rate of 0.001, weight decay of 0.0001, momentum of 0.95, batch size set to 12, and all parameters in the network are initialized with a gaussian distribution with a mean of 0 and variance of 0.001. The model was trained using a server equipped with 2 NVIDIARTX2080Ti GPUs, Intel Xeon E5-2695 CPUs, and Ubuntu16.04 operating system. As previously described, the update parameters are propagated back using 3 frames per video in the training phase using a cross-entropy loss function.
And after the model training is finished, testing, wherein the loss function is not calculated and back propagation is not performed in the testing stage, and the parameters are kept fixed. Different from the training phase, the number N of the selected relevant frames may not be 1, that is, the number of the fused inter-frame relevant information is variable.
The effects of the above method of the embodiment of the present invention are explained as follows:
the experiment was performed starting from 1 for the number of relevant frames N. When N is 1, it is equivalent to take one frame before and after the target frame (i.e. the example given above with reference to fig. 1), and then increase the number of N. Experiments show that the detection effect is best when N is 4 for a 10-second video, which is equivalent to taking 4 frames before and 4 frames after the target frame, adding the target frame to the 4 frames, and totaling 9 frames, and averaging 1 second frame. If the number of reference frames is continuously increased, the detection effect is not obviously improved. The inter-frame attention mechanism and the multi-frame structure designed by the method are effective, and the dynamic change of the inter-frame can be extracted, and the dynamic change mode of tampering the video fake trace is learned, so that the detection effect is improved. In implementation, the program and the trained model can be installed on a social media website or a background server of a short video application to detect the video uploaded by a user, so that forged videos made by various mainstream face tampering methods can be effectively detected, the authenticity of the uploaded video is ensured, and the propagation of false information by using the forged videos is prevented, so that adverse effects are avoided.
For some truer tampered videos, the face replacement effect is more vivid, the authenticity of the video is difficult to judge only from one frame of still image, but some flaws exist in dynamic playing. Compared with the prior art, the scheme discovers the dynamic change information between frames by calculating the attention diagrams among multiple frames. In addition, a multi-stream structure is also used, and a frame sequence of the video is used as input to calculate an attention diagram between every two frames, so that the inter-frame relation of the video is modeled, and the analysis is carried out from the perspective of the whole video. Based on the design, not only static marks of tampering are learned, but also dynamic change modes generated by tampering are learned, the detection performance is enhanced, and the problem that interframe information is not utilized in the existing method is solved.
The method achieves the most advanced effect when experiments are carried out on two face tampering video data sets of faceforces + + and Celeb-DF (V2). The main stream tampering methods, namely, Deepfaces, Face2Face and faceSwap, are accurate to more than 98%. The accuracy rate of the image quality is over 95 percent even in extremely low image quality (faceforces + + c 40). The model trained using faceforces + + c40 was tested on Celeb-DF (V2) and auc reached 70.4. Meanwhile, experiments prove that the detection precision can be improved by increasing the number N of related frames in the testing stage, the number of related frames is related to the length of the video to be detected, when one frame of related frame is taken about every second, the effect is best, more is obtained, the effect is slightly improved, but more computing resources are consumed. As can be seen from the visualization of the convolution layer after the attention information and the target frame feature vector are fused in fig. 5, the network focuses on the tampered parts with large dynamic changes, such as eyes and mouths, of a human, and plays a role in correct judgment.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.