CN116091429A

CN116091429A - A detection method for mosaic tampered images

Info

Publication number: CN116091429A
Application number: CN202211715677.4A
Authority: CN
Inventors: 严彩萍; 李树原; 李红
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-05-09
Anticipated expiration: 2042-12-29
Also published as: CN116091429B

Abstract

本发明公开了一种拼接篡改图像的检测方法。本发明方法首先搜集并下载公共图像篡改数据集，包括CASIA数据集和COLUMBIA数据集，一部分作为训练集，其他作为测试集；然后使用训练集对混合transformer神经网络进行模型训练，包括：特征提取模块、自注意力U形块、交叉注意力模块、特征解码模块；最后使用训练好的网络模型对测试集图像进行测试，得到最终的检测效果。本发明方法将自注意力和交叉注意力整合到U²‑Net中，能够从不同尺度捕获更多的文本信息和空间相关性，提高预测结果的正确性。本发明可以利用卷积的归纳偏差避免大规模预训练，可以定位各种尺度的拼接篡改区域。The invention discloses a method for detecting spliced tampered images. The method of the present invention first collects and downloads public image tampering datasets, including CASIA datasets and COLUMBIA datasets, a part as a training set, and the others as a test set; then use the training set to carry out model training for a hybrid transformer neural network, including: a feature extraction module , self-attention U-shaped block, cross-attention module, and feature decoding module; finally, use the trained network model to test the test set image to obtain the final detection effect. The method of the present invention integrates self-attention and cross-attention into U ² -Net, which can capture more text information and spatial correlation from different scales, and improve the correctness of prediction results. The present invention can avoid large-scale pre-training by using the inductive deviation of convolution, and can locate splicing and tampering regions of various scales.

Description

A method for detecting spliced tampered images

技术领域Technical Field

本发明属于计算机技术领域，尤其是计算机视觉、数字图像处理技术领域，涉及一种拼接篡改图像的检测方法，具体是一种基于混合transformer神经网络的图像篡改检测方法，可以不同尺度下准确定位篡改区域，并且拥有较好的定位性能。The present invention belongs to the field of computer technology, in particular to the field of computer vision and digital image processing technology, and relates to a method for detecting spliced tampered images, specifically an image tampering detection method based on a hybrid transformer neural network, which can accurately locate tampered areas at different scales and has good positioning performance.

背景技术Background Art

随着现代移动设备的快速发展，数字图像的生成和传递变得非常容易。同时，图像编辑软件操作简单，便于任何人修改图像。一些拼接篡改的图像可能被恶意滥用，对社会和国家造成负面影响。因此，检测拼接篡改图像变得越来越重要。With the rapid development of modern mobile devices, it has become very easy to generate and transmit digital images. At the same time, image editing software is easy to operate, making it easy for anyone to modify images. Some spliced and tampered images may be abused maliciously and have a negative impact on society and the country. Therefore, it is becoming increasingly important to detect spliced and tampered images.

目前，对图像篡改区域的检测技术主要分为两种：基于传统特征提取的检测方法和基于卷积神经网络的检测方法。大部分传统的检测方法提取特定的图像指纹来进行检测，如彩色滤波阵列插值、传感器噪声、光照不一致性(PRNU)。传统的检测方法只能检测特定的一种图像指纹，但当图像中的特定指纹不存在或不明显时，检测就会失败。另外，特定的图像指纹会受到图像模糊、JPEG压缩和下采样等后处理效果的影响，这会降低传统检测方法的检测结果。近年来，人们提出了许多基于CNN的篡改检测方法，并表现出比传统方法更好的性能。这些基于CNN的篡改检测方法可以同时提取多种图像指纹，弥补了传统方法依赖单一图像属性、缺乏泛化能力和鲁棒性差的缺陷。例如Faster R-CNN使用RGB流和噪声流来检测给定篡改图像的篡改区域。他们使用SRM滤波器层从篡改图像中提取噪声特征，这允许模型捕获篡改区域和真实区域之间的噪声不一致性。然而，该方法的局限性在于它只能实现区域级篡改定位结果。RRU-Net结构通过残差传播和残差反馈模块来捕获篡改区域和非篡改区域之间的区别特征，从而增强了CNN的学习模式。然而，RRU-Net不善于提取全局特征，导致对大尺度篡改区域的定位效果较差，误检率较高。MWC-Net可以学习更全面和更具代表性的特征。然而，此方法仍然没有考虑全局特征，这可能导致定位大尺度篡改区域时性能较差。At present, the detection technology of image tampering area is mainly divided into two types: detection method based on traditional feature extraction and detection method based on convolutional neural network. Most traditional detection methods extract specific image fingerprints for detection, such as color filter array interpolation, sensor noise, and illumination inconsistency (PRNU). Traditional detection methods can only detect a specific image fingerprint, but when the specific fingerprint in the image does not exist or is not obvious, the detection will fail. In addition, the specific image fingerprint will be affected by post-processing effects such as image blur, JPEG compression and downsampling, which will reduce the detection results of traditional detection methods. In recent years, many CNN-based tampering detection methods have been proposed and have shown better performance than traditional methods. These CNN-based tampering detection methods can extract multiple image fingerprints at the same time, making up for the defects of traditional methods that rely on single image attributes, lack of generalization ability and poor robustness. For example, Faster R-CNN uses RGB streams and noise streams to detect tampered areas of a given tampered image. They use the SRM filter layer to extract noise features from the tampered image, which allows the model to capture the noise inconsistency between the tampered area and the real area. However, the limitation of this method is that it can only achieve regional-level tampering localization results. The RRU-Net structure captures the distinguishing features between tampered and non-tampered areas through residual propagation and residual feedback modules, thereby enhancing the learning mode of CNN. However, RRU-Net is not good at extracting global features, resulting in poor positioning of large-scale tampered areas and high false positive rates. MWC-Net can learn more comprehensive and representative features. However, this method still does not consider global features, which may lead to poor performance when locating large-scale tampered areas.

大多数先前的工作忽略了篡改区域的大小是变化的，并且难以定位不同大小的篡改区域。由于卷积运算的固有局部性，基于CNN的方法很难学习显式的全局语义信息关系，并且难以联合利用局部和全局特征。因此，大多数基于CNN的检测方法只能处理有限的尺度变化。此外，这些方法在定位大尺度篡改区域时可能会出现定位不完整或高误检率等问题。Most previous works ignore that the size of tampered regions varies and have difficulty locating tampered regions of different sizes. Due to the inherent locality of convolution operations, CNN-based methods have difficulty learning explicit global semantic information relationships and jointly utilizing local and global features. Therefore, most CNN-based detection methods can only handle limited scale variations. In addition, these methods may suffer from incomplete positioning or high false detection rates when locating large-scale tampered regions.

发明内容Summary of the invention

本发明目的是针对现有技术的不足，提供一种基于卷积神经网络的图像篡改检测方法。The purpose of the present invention is to provide an image tampering detection method based on convolutional neural network in view of the deficiencies in the prior art.

本发明方法具体是：The method of the present invention specifically comprises:

步骤(1)步骤(1)搜集并下载公共图像篡改数据集，包括CASIA数据集和COLUMBIA数据集；取其中的80～90﹪作为训练集，其他作为测试集；Step (1) Collect and download public image tampering datasets, including the CASIA dataset and the COLUMBIA dataset; take 80% to 90% of them as training sets and the rest as test sets;

步骤(2)使用训练集对混合transformer神经网络进行模型训练，包括：特征提取模块、自注意力U形块、交叉注意力模块、特征解码模块；Step (2) using the training set to perform model training on the hybrid transformer neural network, including: feature extraction module, self-attention U-shaped block, cross attention module, feature decoding module;

进一步，所述的特征提取模块，使用U²-Net的编码器部分提取输入图像的特征图X；首先将尺寸为288×288×3的输入图像送入残差U形块-7，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行五次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行五次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终得到64幅特征图，从残差U形块-7输出的特征图进行一次最大值池化下采样的操作将特征图分辨率降为原来的一半；然后将特征图依次送入残差U形块-6、残差U形块-5、残差U形块-4、残差U形块-4F，最终输出尺寸为9×9×512特征图。Further, the feature extraction module uses the encoder part of U ² -Net to extract the feature map X of the input image; first, the input image of size 288×288×3 is sent to the residual U-shaped block-7, and two convolutions with a convolution kernel size of 3×3, a step size of 1, and a padding of 1 are performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then five convolutions with a convolution kernel size of 3×3, a step size of 1, and a padding of 1 plus maximum pooling downsampling are performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then a convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2 is performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then a convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2 is performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then a convolution with a convolution kernel size of 3×3, a step size of 3×3, a step size of The convolution with a kernel size of 1 and padding of 1, the batch normalized data flow layer norm, and the activation function Relu are performed; then five convolutions with a kernel size of 3×3, a step size of 1, and padding of 1 are performed plus sampling, the batch normalized data flow layer norm, and the activation function Relu are performed; then the output feature map after the first convolution and the feature map after the last convolution are added together, and finally 64 feature maps are obtained. The feature map output by the residual U-shaped block-7 is subjected to a maximum pooling downsampling operation to reduce the feature map resolution to half of the original; then the feature map is sent to the residual U-shaped block-6, residual U-shaped block-5, residual U-shaped block-4, and residual U-shaped block-4F in turn, and the final output size is a 9×9×512 feature map.

进一步，所述的自注意力U形块，使用self-attention建模输入图像的长程依赖性以及提取输入图像的全局信息；将尺寸为9×9×512特征图作为输入进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再经过一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为8、空洞率为8的卷积，批归一化数据流层norm，激活函数Relu；Furthermore, the self-attention U-shaped block uses self-attention to model the long-range dependency of the input image and extract the global information of the input image; the feature map of size 9×9×512 is used as input for two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1, a batch normalized data flow layer norm, and an activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2 is performed, a batch normalized data flow layer norm, and an activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 4, and a void rate of 4 is performed, a batch normalized data flow layer norm, and an activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 8, and a void rate of 8 is performed, a batch normalized data flow layer norm, and an activation function Relu;

然后将特征图送入自注意力模块，在输入特征图X∈R^d×H×W上添加位置编码，H和分别W是特征图的高度和宽度，d是通道数，R表示实数域；然后将X展平并转置为大小为n×d的序列，参数n＝H×W；使用三个1×1卷积投影X进行查询、密钥和值嵌入：Q,K,V∈R^d×n，Q表示查询矩阵，K表示键矩阵，V表示值矩阵；自注意力的输出为一个缩放点积：

A∈R^n×n为注意力矩阵，上标T表示转置；The feature map is then fed into the self-attention module, which adds a positional encoding to the input feature map X∈Rd ^×H×W , where H and W are the height and width of the feature map, d is the number of channels, and R represents the real domain; X is then flattened and transposed into a sequence of size n×d, with parameter n=H×W; three 1×1 convolutional projections X are used for query, key, and value embedding: Q,K,V∈Rd ^×n , where Q represents the query matrix, K represents the key matrix, and V represents the value matrix; the output of the self-attention is a scaled dot product:

A∈R ^n×n is the attention matrix, and the superscript T indicates transposition;

再进行一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1卷积，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为9×9×512特征图。Perform another convolution with a kernel size of 3×3, a stride of 1, a padding of 4, and a void rate of 4, batch normalize the data flow layer norm, and activate the Relu function; perform another convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2, batch normalize the data flow layer norm, and activate the Relu function; perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalize the data flow layer norm, and activate the Relu function; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is a 9×9×512 feature map.

进一步，所述的交叉注意力模块，使用交叉注意力过滤特征提取模块和特征解码模块之间的非语义特征；首先在低层特征图U∈R^d×2H×2W上添加位置编码，H和分别W是特征图的高度和宽度，d是通道数，R表示实数域；然后，将低层特征图U下采样一次并用作交叉注意力模块的值矩阵；在高级特征图N∈R^2d×H×W添加位置编码，并且用1×1卷积改变通道数为d以后作为查询矩阵和键矩阵；Further, the cross attention module uses cross attention to filter non-semantic features between the feature extraction module and the feature decoding module; first, position encoding is added to the low-level feature map U∈R ^d×2H×2W , H and W are the height and width of the feature map, d is the number of channels, and R represents the real number domain; then, the low-level feature map U is downsampled once and used as a value matrix of the cross attention module; position encoding is added to the high-level feature map N∈R ^2d×H×W , and the number of channels is changed to d by 1×1 convolution as a query matrix and a key matrix;

然后使用矩阵乘法和softmax函数计算注意力矩阵A∈R^n×n，其中n＝H×W；通过Relu激活函数重新缩放计算的权重值，得到作为滤波器的结果S，其中低幅度元素表示要减少的噪声或无关区域；Then, the attention matrix A∈Rn ^×n is calculated using matrix multiplication and softmax function, where n=H×W; the calculated weight values are rescaled by the Relu activation function to obtain the result S as the filter, where low-amplitude elements represent noise or irrelevant areas to be reduced;

将U和S做Hadamard乘积得到U过滤掉非语义特征的版本；Take the Hadamard product of U and S to get the version of U with non-semantic features filtered out;

最后，将滤波操作的结果与高级特征图N连接。Finally, the result of the filtering operation is concatenated with the high-level feature map N.

进一步，所述的特征解码模块，将具有全局信息以及局部信息的的特征图解码，并在像素级预测篡改区域；将自注意力U形块得到的特征图上采样，和经过交叉注意力模块的输出特征图进行级联，得到的特征图送入残差U形块-4F；将残差U形块-4F输出特征图上采样和经过交叉注意力模块的输出特征图进行级联，得到的特征图送入残差U形块-4；将残差U形块-4输出特征图上采样和经过交叉注意力模块的输出特征图进行级联，得到的特征图送入残差U形块-5；将残差U形块-5输出特征图上采样和经过交叉注意力模块的输出特征图进行级联，得到的特征图送入残差U形块-6；将残差U形块-6输出特征图上采样和经过交叉注意力模块的输出特征图进行级联，得到的特征图送入残差U形块-7；将残差U形块-7输出的尺寸为288×288×64的特征图，经过3×3的卷积层，输出尺寸为288×288×1的特征图；再经过激活函数Sigmiod得到尺寸为288×288×1的单通道篡改概率掩模图；整个神经网络通过随机梯度下降优化算法最小化交叉熵损失函数来对预测的结果进行优化，交叉熵损失函数：

其中(r,c)是像素坐标，p_(r,c)和y_(r,c)分别表示输入图像预测的像素值和标定真值的像素值。Furthermore, the feature decoding module decodes the feature map with global information and local information, and predicts the tampered area at the pixel level; the feature map obtained by the self-attention U-shaped block is upsampled and cascaded with the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-4F; the output feature map of the residual U-shaped block-4F is upsampled and cascaded with the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-4; the output feature map of the residual U-shaped block-4 is upsampled and cascaded with the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-5; the output feature map of the residual U-shaped block-5 is upsampled and cascaded with the output feature map of the cross attention module The output feature map of the attention module is cascaded, and the obtained feature map is sent to the residual U-shaped block-6; the output feature map of the residual U-shaped block-6 is upsampled and cascaded with the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-7; the feature map of the residual U-shaped block-7 with a size of 288×288×64 is output through a 3×3 convolution layer to output a feature map of size 288×288×1; and then the single-channel tampering probability mask map with a size of 288×288×1 is obtained through the activation function Sigmiod; the entire neural network optimizes the predicted results by minimizing the cross entropy loss function through the stochastic gradient descent optimization algorithm. The cross entropy loss function is:

Where (r, c) is the pixel coordinate, p _{(r, c)} and y _{(r, c)} represent the pixel values of the predicted input image and the calibrated true value, respectively.

步骤(3)使用训练好的网络模型对测试集图像进行测试，得到最终的检测效果。Step (3) uses the trained network model to test the test set images to obtain the final detection effect.

激活函数Relu定义为f(z_i)＝max(0,z_i)，z_i是卷积操作后的结果，若z_i≤0,f(z_i)＝0,若z_i＞0,f(z_i)＝z_i。激活函数Sigmiod定义为

z_i是卷积操作后的结果。The activation function Relu is defined as f(z _i ) = max(0,z _i ), where z _i is the result of the convolution operation. If z _i ≤ 0, f(z _i ) = 0, and if z _i > 0, f(z _i ) = z _i . The activation function Sigmiod is defined as

_zi is the result after the convolution operation.

本发明将自注意力和交叉注意力整合到U²-Net中，能够从不同尺度捕获更多的文本信息和空间相关性。其中的交叉注意力模块可以滤除非语义特征，在高级语义信息的指导下增强了通过跳跃连接的低级特征图，并在解码器中实现精细的空间恢复，从而最终提高预测结果的正确性。其次，本发明编码器的最后一块应用自注意力，以结合卷积和自注意力机制的优点。因此，本发明可以利用卷积的归纳偏差来避免大规模预训练，以及Transformer学习建模显式全局和长程语义信息依赖的能力。总之，本发明将卷积和Transformer融合在一起，可以定位各种尺度的拼接篡改区域，从而在CASIA和COLUMBIA这两个图像篡改公共数据集上取得了最先进性能。The present invention integrates self-attention and cross-attention into U ² -Net, which can capture more text information and spatial correlation from different scales. The cross-attention module can filter out non-semantic features, enhance the low-level feature maps connected by jumps under the guidance of high-level semantic information, and achieve fine spatial recovery in the decoder, thereby ultimately improving the correctness of the prediction results. Secondly, the last block of the encoder of the present invention applies self-attention to combine the advantages of convolution and self-attention mechanisms. Therefore, the present invention can utilize the inductive bias of convolution to avoid large-scale pre-training, as well as the ability of Transformer to learn and model explicit global and long-range semantic information dependencies. In short, the present invention combines convolution and Transformer together, and can locate spliced tampering areas of various scales, thereby achieving state-of-the-art performance on two public image tampering datasets, CASIA and COLUMBIA.

具体实施方式DETAILED DESCRIPTION

下面通过具体实施例，对本发明的技术方案做进一步说明。The technical solution of the present invention is further described below through specific embodiments.

一种拼接篡改图像的检测方法，具体如下：A method for detecting spliced tampered images is as follows:

步骤(1)搜集并下载公共图像篡改数据集，包括CASIA数据集和COLUMBIA数据集；取其中的80～90﹪作为训练集，其他作为测试集，本实施例取85﹪作为训练集。Step (1) collect and download public image tampering datasets, including the CASIA dataset and the COLUMBIA dataset; take 80% to 90% of them as training sets and the rest as test sets. In this embodiment, 85% is taken as the training set.

所述的特征提取模块，使用U²-Net的编码器部分提取输入图像的特征图X；具体是：The feature extraction module uses the encoder part of U ² -Net to extract the feature map X of the input image; specifically:

将尺寸为288×288×3的图像送入残差U形块-7，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；然后进行五次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行五次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，得到输出尺寸为288×288×64特征图，经过最大值池化数据流层pool，最终输出尺寸为144×144×64特征图；The image of size 288×288×3 is fed into residual U-shaped block-7, and two convolutions with kernel size of 3×3, stride of 1, padding of 1 are performed, batch normalization data flow layer norm, activation function Relu; then five convolutions with kernel size of 3×3, stride of 1, padding of 1 plus maximum pooling downsampling are performed, batch normalization data flow layer norm, activation function Relu; then one more convolution with kernel size of 3×3, stride of 1, padding of 2, dilation rate of 2, batch normalization data flow layer norm, activation function Relu; Perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalization data flow layer norm, and activation function Relu; perform five more convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus sampling, batch normalization data flow layer norm, and activation function Relu; add the output feature map after the first convolution and the feature map after the last convolution to obtain an output feature map of size 288×288×64, and after the maximum pooling data flow layer pool, the final output size is 144×144×64 feature map;

将尺寸为144×144×64特征图继续送入残差U形块-6，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；然后进行四次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行四次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，得到输出尺寸为144×144×128特征图，经过最大值池化数据流层pool，最终输出尺寸为72×72×128特征图；The feature map of size 144×144×64 is further fed into the residual U-shaped block-6, and two convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 are performed, the batch normalized data flow layer norm, and the activation function Relu are performed; then four convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus maximum pooling downsampling are performed, the batch normalized data flow layer norm, and the activation function Relu are performed; and another convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2 is performed, the batch normalized data flow layer norm, and the activation function Relu is performed. ; Perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalization data flow layer norm, and activation function Relu; perform four more convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus sampling, batch normalization data flow layer norm, and activation function Relu; add the output feature map after the first convolution and the feature map after the last convolution to obtain an output feature map of size 144×144×128, and after the maximum pooling data flow layer pool, the final output size is 72×72×128 feature map;

将尺寸为72×72×128特征图继续送入残差U形块-5，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；然后进行三次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行三次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，得到输出尺寸为72×72×256特征图，再经过最大值池化数据流层pool，最终输出尺寸为36×36×256特征图；The feature map of size 72×72×128 is further fed into the residual U-shaped block-5, and two convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 are performed, the batch normalized data flow layer norm, and the activation function Relu are performed; then three convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus maximum pooling downsampling are performed, the batch normalized data flow layer norm, and the activation function Relu are performed; and another convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2 is performed, the batch normalized data flow layer norm, and the activation function Relu is performed. ; Perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalization data flow layer norm, and activation function Relu; perform three more convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus sampling, batch normalization data flow layer norm, and activation function Relu; add the output feature map after the first convolution and the feature map after the last convolution to obtain an output feature map of size 72×72×256, and then pass through the maximum pooling data flow layer pool, and the final output size is 36×36×256 feature map;

将尺寸为36×36×256特征图继续送入残差U形块-4，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；然后进行两次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行两次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，得到输出尺寸为36×36×512特征图，再经过最大值池化数据流层pool，最终输出尺寸为18×18×512特征图；The feature map of size 36×36×256 is further fed into the residual U-shaped block-4, and two convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 are performed, the data flow layer norm is batch normalized, and the activation function is Relu; then two convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 are performed plus maximum pooling downsampling, the data flow layer norm is batch normalized, and the activation function is Relu; then a convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2 is performed, the data flow layer norm is batch normalized, and the activation function is Relu ; Perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalization data flow layer norm, and activation function Relu; perform two more convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus sampling, batch normalization data flow layer norm, and activation function Relu; add the output feature map after the first convolution and the feature map after the last convolution to obtain an output feature map with an output size of 36×36×512, and then pass through the maximum pooling data flow layer pool, and the final output size is 18×18×512 feature map;

将尺寸为36×36×256特征图继续送入残差U形块-4F，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；然后进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再经过一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为8、空洞率为8的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1卷积，批归一化数据流层norm，激活函数Relu；把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，得到输出尺寸为18×18×512特征图，再经过最大值池化数据流层pool，最终输出尺寸为9×9×512特征图。The feature map of size 36×36×256 is further fed into the residual U-shaped block-4F, and two convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 are performed, the batch normalized data flow layer norm, and the activation function Relu are performed; then a convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2 is performed, the batch normalized data flow layer norm, and the activation function Relu are performed; another convolution with a kernel size of 3×3, a stride of 1, a padding of 4, and a void rate of 4, the batch normalized data flow layer norm, and the activation function Relu are performed; another convolution with a kernel size of 3×3, a stride of 1, a padding of 8, and a void rate of 8 is performed, the batch normalized data flow layer norm, and the activation function Relu ; Perform another convolution with a kernel size of 3×3, a step size of 1, a padding of 4, and a void rate of 4, batch normalization data flow layer norm, and activation function Relu; perform another convolution with a kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2, batch normalization data flow layer norm, and activation function Relu; perform another convolution with a kernel size of 3×3, a step size of 1, and a padding of 1, batch normalization data flow layer norm, and activation function Relu; add the output feature map of the first convolution and the feature map of the last convolution to obtain an output feature map of size 18×18×512, and then pass through the maximum pooling data flow layer pool, and the final output size is 9×9×512 feature map.

批归一化数据流层norm，数据批次B＝{x₁,…,x_i,…,x_n}中的每个x_i转换为y_i的计算更新定义为：

其中，超参数{γ,β}实现数据批次的打乱，防止训练发生偏移。μ_B和

分别表示批次B中的均值和方差，ε是为了避免除数为0而使用的微小正数。In the batch normalization data stream layer norm, the computational update for transforming each _xi to _yi in a data batch B = { _x1 , ..., _xi , ..., _xn } is defined as:

The hyperparameters {γ,β} are used to shuffle the data batches and prevent training from drifting _.

denote the mean and variance in batch B respectively, and ε is a small positive number used to avoid division by 0.

所述的自注意力U形块，使用self-attention建模输入图像的长程依赖性以及提取输入图像的全局信息；具体是：The self-attention U-shaped block uses self-attention to model the long-range dependency of the input image and extract the global information of the input image; specifically:

将尺寸为9×9×512特征图作为输入进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再经过一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为8、空洞率为8的卷积，批归一化数据流层norm，激活函数Relu；Take the feature map of size 9×9×512 as input and perform two convolutions with kernel size of 3×3, stride of 1, padding of 1, batch normalization data flow layer norm, activation function Relu; then perform one convolution with kernel size of 3×3, stride of 1, padding of 2, dilation rate of 2, batch normalization data flow layer norm, activation function Relu; then perform one convolution with kernel size of 3×3, stride of 1, padding of 4, dilation rate of 4, batch normalization data flow layer norm, activation function Relu; then perform one convolution with kernel size of 3×3, stride of 1, padding of 8, dilation rate of 8, batch normalization data flow layer norm, activation function Relu;

然后将特征图送入自注意力模块，为了考虑绝对语义信息，首先在输入特征图X∈R^d×H×W上添加位置编码，H和分别W是特征图的高度和宽度，d是通道数，R表示实数域；位置编码特别适合于在自注意力中捕获篡改区域之间的绝对和相对位置；然后将X展平并转置为大小为n×d的序列，参数n＝H×W；使用三个1×1卷积投影X进行查询、密钥和值嵌入：Q,K,V∈R^d×n，Q表示查询矩阵，K表示键矩阵，V表示值矩阵；自注意力的输出为一个缩放点积：

A∈R^n×n为注意力矩阵，上标T表示转置；注意力矩阵A用作权重，以考虑查询和键之间的所有交互；The feature map is then fed into the self-attention module. In order to consider the absolute semantic information, position encoding is first added to the input feature map X∈R ^d×H×W , where H and W are the height and width of the feature map, d is the number of channels, and R represents the real number domain; position encoding is particularly suitable for capturing the absolute and relative positions between tampered areas in self-attention; X is then flattened and transposed into a sequence of size n×d, with parameter n=H×W; three 1×1 convolutional projections X are used for query, key, and value embedding: Q,K,V∈R ^d×n , where Q represents the query matrix, K represents the key matrix, and V represents the value matrix; the output of self-attention is a scaled dot product:

A∈R ^n×n is the attention matrix, and the superscript T represents the transpose. The attention matrix A is used as a weight to consider all interactions between queries and keys.

所述的交叉注意力模块，使用交叉注意力过滤特征提取模块和特征解码模块之间的非语义特征，在特征解码模块中实现更精细的空间恢复，并提高预测结果的正确性；具体是：The cross attention module uses cross attention to filter non-semantic features between the feature extraction module and the feature decoding module, achieves more refined spatial recovery in the feature decoding module, and improves the accuracy of the prediction results; specifically:

首先在低层特征图U∈R^d×2H×2W上添加位置编码，H和分别W是特征图的高度和宽度，d是通道数，R表示实数域；然后，将低层特征图U下采样一次并用作交叉注意力模块的值矩阵；在高级特征图N∈R^2d×H×W添加位置编码并且用1×1卷积改变通道数为d以后作为查询矩阵和键矩阵；First, add position encoding to the low-level feature map U∈R ^d×2H×2W , where H and W are the height and width of the feature map, d is the number of channels, and R represents the real number domain; then, downsample the low-level feature map U once and use it as the value matrix of the cross-attention module; add position encoding to the high-level feature map N∈R ^2d×H×W and use 1×1 convolution to change the number of channels to d as the query matrix and key matrix;

最后，将滤波操作的结果与高级特征图N连接。通过交叉注意力模块，可以保留比普通跳跃连接更详细的信息，从而提高检测性能。Finally, the result of the filtering operation is concatenated with the high-level feature map N. Through the cross-attention module, more detailed information can be retained than ordinary skip connections, thereby improving the detection performance.

所述的特征解码模块将具有全局信息以及局部信息的的特征图解码，并在像素级预测篡改区域。特征解码模块与具有与其对称特征提取模块类似的结构；具体是：The feature decoding module decodes the feature map with global information and local information, and predicts the tampered area at the pixel level. The feature decoding module has a similar structure to its symmetrical feature extraction module; specifically:

将自注意力U形块得到的特征图尺寸为9×9×512的特征图上采样和经过交叉注意力模块的输出特征图进行级联得到18×18×1024的特征图，再经过残差U形块-4F，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再经过一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为8、空洞率为8的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为4、空洞率为4的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1卷积，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为18×18×512特征图；The feature map of the self-attention U-shaped block with a size of 9×9×512 is upsampled and the output feature map of the cross-attention module is cascaded to obtain a feature map of 18×18×1024, and then passed through the residual U-shaped block-4F, and two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed, the batch normalized data flow layer norm, and the activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2 is performed, the batch normalized data flow layer norm, and the activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 4, and a void rate of 4 is performed, the batch normalized data flow layer norm, and the activation function Relu; then a convolution with a kernel size of 3×3, a step size of 1, a padding of 4, and a void rate of 4 is performed, the batch normalized data flow layer norm, and the activation function Relu; then a convolution with a kernel size of 3× 3. Convolution with a step size of 1, a padding of 8, and a hole rate of 8, batch normalization of the data flow layer norm, and activation function Relu; perform another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 4, and a hole rate of 4, batch normalization of the data flow layer norm, and activation function Relu; perform another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 2, and a hole rate of 2, batch normalization of the data flow layer norm, and activation function Relu; perform another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 1, batch normalization of the data flow layer norm, and activation function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is 18×18×512 feature map;

将残差U形块-4F输出的尺寸为18×18×512的特征图上采样和经过交叉注意力模块的输出特征图进行级联得到36×36×1024的特征图，再经过残差U形块-4，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行两次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行两次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为36×36×256特征图；The feature map of size 18×18×512 output by the residual U-shaped block-4F is upsampled and the output feature map of the cross attention module is cascaded to obtain a feature map of 36×36×1024. Then, it passes through the residual U-shaped block-4, performs two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1, batch normalization of the data flow layer norm, and activation function Relu; then, two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 plus maximum pooling downsampling, batch normalization of the data flow layer norm, and activation function Relu are performed; and then a convolution kernel size of 3 is performed. ×3, stride 1, padding 2, dilation rate 2 convolution, batch normalization data flow layer norm, activation function Relu; then perform a convolution with a kernel size of 3×3, stride 1, padding 1, batch normalization data flow layer norm, activation function Relu; then perform two convolutions with a kernel size of 3×3, stride 1, padding 1 plus sampling, batch normalization data flow layer norm, activation function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is a 36×36×256 feature map;

将残差U形块-4输出的尺寸为36×36×256的特征图上采样和经过交叉注意力模块的输出特征图进行级联得到72×72×512的特征图，再经过残差U形块-5，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行三次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行三次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为72×72×128的特征图；The feature map of size 36×36×256 output by residual U-shaped block-4 is upsampled and the output feature map of the cross attention module is cascaded to obtain a feature map of 72×72×512, and then passed through residual U-shaped block-5, and two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed, the batch normalization data flow layer is norm, and the activation function is Relu; then three convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed plus maximum pooling downsampling, batch normalization data flow layer norm, activation function Relu; and one more convolution with a kernel size of 3× 3. Convolution with a step size of 1, a padding of 2, and a void rate of 2, batch normalization of the data flow layer norm, and activation function Relu; perform another convolution with a kernel size of 3×3, a step size of 1, and a padding of 1, batch normalization of the data flow layer norm, and activation function Relu; perform three more convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 plus sampling, batch normalization of the data flow layer norm, and activation function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and finally output a feature map of size 72×72×128;

将残差U形块-5输出的尺寸为72×72×128的特征图上采样和经过交叉注意力模块的输出特征图进行级联得到144×144×256的特征图，再经过残差U形块-6，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行四次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行四次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为144×144×64特征图；The feature map of size 72×72×128 output by residual U-shaped block-5 is upsampled and the output feature map of the cross attention module is cascaded to obtain a feature map of 144×144×256, and then passed through residual U-shaped block-6, and two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then four convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed plus maximum pooling downsampling, batch normalization data flow layer norm, and activation function Relu are performed; and then a convolution kernel size of 3 is performed. ×3, stride 1, padding 2, dilation rate 2 convolution, batch normalization data flow layer norm, activation function Relu; then perform a convolution with a kernel size of 3×3, stride 1, padding 1, batch normalization data flow layer norm, activation function Relu; then perform four convolutions with a kernel size of 3×3, stride 1, padding 1 plus sampling, batch normalization data flow layer norm, activation function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is 144×144×64 feature map;

将残差U形块-6输出的尺寸为144×144×64的特征图上采样和经过交叉注意力模块的输出特征图进行级联得到288×288×128的特征图，再经过残差U形块-7，进行两次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行五次卷积核大小为3×3、步长为1、填充为1的卷积加最大值池化下采样，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为2、空洞率为2的卷积，批归一化数据流层norm，激活函数Relu；再进行一次卷积核大小为3×3、步长为1、填充为1的卷积，批归一化数据流层norm，激活函数Relu；再进行五次卷积核大小为3×3、步长为1、填充为1的卷积加上采样，批归一化数据流层norm，激活函数Relu；再把经过第一次卷积的输出特征图和经过最后一次卷积的特征图相加，最终输出尺寸为288×288×64特征图；The feature map of size 144×144×64 output by residual U-shaped block-6 is upsampled and the output feature map of the cross attention module is cascaded to obtain a feature map of 288×288×128, and then passed through residual U-shaped block-7, and two convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed, the batch normalization data flow layer norm, and the activation function Relu are performed; then five convolutions with a kernel size of 3×3, a step size of 1, and a padding of 1 are performed plus maximum pooling downsampling, batch normalization data flow layer norm, and activation function Relu are performed; and then a convolution kernel size of 3 is performed. ×3, stride 1, padding 2, dilation rate 2 convolution, batch normalization data flow layer norm, activation function Relu; then perform a convolution with a kernel size of 3×3, stride 1, padding 1, batch normalization data flow layer norm, activation function Relu; then perform five convolutions with a kernel size of 3×3, stride 1, padding 1 plus sampling, batch normalization data flow layer norm, activation function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is 288×288×64 feature map;

将残差U形块-7输出的尺寸为288×288×64的特征图，经过3×3的卷积层，输出尺寸为288×288×1的特征图；再经过Sigmiod激活函数得到尺寸为288×288×1的单通道篡改概率掩模图。整个神经网络通过随机梯度下降优化算法最小化交叉熵损失函数来对预测的结果进行优化，交叉熵损失函数：

其中(r,c)是像素坐标，p_(r,c)和y_(r,c)分别表示输入图像预测的像素值和标定真值的像素值。The residual U-shaped block-7 outputs a feature map of size 288×288×64, which is passed through a 3×3 convolution layer to output a feature map of size 288×288×1; and then passes through the Sigmiod activation function to obtain a single-channel tampering probability mask map of size 288×288×1. The entire neural network optimizes the predicted results by minimizing the cross entropy loss function through the stochastic gradient descent optimization algorithm. The cross entropy loss function is:

步骤(3)使用训练好的网络模型对测试集图像进行测试，得到最终的检测效果，其准确率可达到81％左右。Step (3) uses the trained network model to test the test set images to obtain the final detection effect, with an accuracy of about 81%.

测试使用I7 9700K CPU，内存为32GB，显卡为GTX 3060GPU，python版本为3.8，pytorch版本为1.8.1。测试过程中，学习率为0.0001，权重衰减设置为0，经过150次迭代训练后，得到最终训练好的模型。我们将CASIA数据集和COLUMBIA数据集其中的85％作为训练集来训练基于混合transformer的神经网络，将其中的15％作为测试集测试基于混合transformer神经网络的检测精度。将本发明方法与现有方法ADQ、CFA、ELA、NOI、RRU-Net、U-Net和ManTra-Net、U²-Net做对比分析。对比分析所用的评价指标包括：Precision为精确率、Recall为召回率，F1为精确度和召回率的平衡。所有评价指标都是越高越好，越靠近1越好。下表为不同方法在不同数据集上的篡改区域定位比较结果。The test uses an I7 9700K CPU, 32GB of memory, a GTX 3060 GPU graphics card, Python version 3.8, and Pytorch version 1.8.1. During the test, the learning rate is 0.0001, the weight decay is set to 0, and after 150 iterations of training, the final trained model is obtained. We use 85% of the CASIA data set and the COLUMBIA data set as training sets to train the hybrid transformer-based neural network, and 15% of them as test sets to test the detection accuracy based on the hybrid transformer neural network. The method of the present invention is compared with the existing methods ADQ, CFA, ELA, NOI, RRU-Net, U-Net and ManTra-Net, U ² -Net for analysis. The evaluation indicators used for comparative analysis include: Precision is the precision rate, Recall is the recall rate, and F1 is the balance between precision and recall rate. All evaluation indicators are the higher the better, and the closer to 1, the better. The following table shows the comparison results of tampering area positioning of different methods on different data sets.

可以看到在CASIA和COLUMBIA数据集上都达到了最高的精确率和F1值。进一步的主观比较可以看到，本发明的定位区域更加完整，定位边缘更加精准，意味着本发明在应对拼接篡改类型的有效性。It can be seen that the highest accuracy and F1 value are achieved on both CASIA and COLUMBIA datasets. Further subjective comparison shows that the positioning area of the present invention is more complete and the positioning edge is more accurate, which means that the present invention is effective in dealing with splicing tampering types.

在CASIA数据集上，随着高斯噪声方差的增大，本发明的精确率和F1值远优于其他六种检测方法，本发明的召回率略低于RRU-Net，并且本发明的检测指数受到的影响最小。实验证明，本发明对数据集的噪声攻击具有鲁棒性。此外，随着JPEG质量因子从100降低到50时，本发明的精确率和F1值仍然优于其他方法，并且对数据集的压缩攻击具有鲁棒性。On the CASIA dataset, as the variance of Gaussian noise increases, the precision and F1 value of the present invention are much better than the other six detection methods, the recall rate of the present invention is slightly lower than that of RRU-Net, and the detection index of the present invention is minimally affected. Experiments have shown that the present invention is robust to noise attacks on the dataset. In addition, as the JPEG quality factor decreases from 100 to 50, the precision and F1 value of the present invention are still better than other methods, and it is robust to compression attacks on the dataset.

以上这些实施例应理解为仅用于说明本发明而不用于限制本发明的保护范围。凡是未脱离本发明技术方案内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均仍属于本发明技术方案的范围内。The above embodiments should be understood as being only used to illustrate the present invention and not to limit the protection scope of the present invention. Any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention without departing from the technical solution of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A method for detecting spliced tampered images, characterized in that:

Step (1) Collect and download public image tampering datasets, including the CASIA dataset and the COLUMBIA dataset; take 80% to 90% of them as training sets and the rest as test sets;

Step (2) using the training set to perform model training on the hybrid transformer neural network, including: feature extraction module, self-attention U-shaped block, cross attention module, feature decoding module;

The feature extraction module uses the encoder part of U ² -Net to extract the feature map X of the input image;

The self-attention U-shaped block uses self-attention to model the long-range dependency of the input image and extract the global information of the input image;

The cross attention module uses cross attention to filter non-semantic features between the feature extraction module and the feature decoding module;

The feature decoding module decodes the feature map with global information and local information, and predicts the tampered area at the pixel level;

Step (3) uses the trained network model to test the test set images to obtain the final detection effect.

2. A detection method for spliced tampered images as described in claim 1, characterized in that the feature extraction module first feeds the input image of size 288×288×3 into the residual U-shaped block-7, performs two convolutions with a convolution kernel size of 3×3, a step size of 1, and a padding of 1, batch normalization of the data flow layer norm, and activation function Relu; then performs five convolutions with a convolution kernel size of 3×3, a step size of 1, and a padding of 1 plus maximum pooling downsampling, batch normalization of the data flow layer norm, and activation function Relu; then performs one convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2, Batch normalize the data flow layer norm, activation function Relu; perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalize the data flow layer norm, and activate the function Relu; perform five more convolutions with a kernel size of 3×3, a stride of 1, and a padding of 1 plus sampling, batch normalize the data flow layer norm, and activate the function Relu; then add the output feature map of the first convolution and the feature map of the last convolution, and finally get 64 feature maps. Perform a maximum pooling downsampling operation on the feature map output by the residual U-shaped block-7 to reduce the feature map resolution to half of the original;

The feature map is then sent to residual U-shaped block-6, residual U-shaped block-5, residual U-shaped block-4, and residual U-shaped block-4F in sequence, and the final output size is a 9×9×512 feature map.

3. A detection method for splicing tampered images as described in claim 1, characterized in that the self-attention U-shaped block specifically comprises: taking a feature map of size 9×9×512 as input, performing two convolutions with a convolution kernel size of 3×3, a step size of 1, and a padding of 1, batch normalization of the data flow layer norm, and activation function Relu; performing another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 2, and a void rate of 2, batch normalization of the data flow layer norm, and activation function Relu; performing another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 4, and a void rate of 4, batch normalization of the data flow layer norm, and activation function Relu; performing another convolution with a convolution kernel size of 3×3, a step size of 1, a padding of 8, and a void rate of 8, batch normalization of the data flow layer norm, and activation function Relu;

The feature map is then fed into the self-attention module, which adds positional encoding to the input feature map X∈Rd ^×H×W , where H and W are the height and width of the feature map, d is the number of channels, and R represents the real domain; X is then flattened and transposed into a sequence of size n×d, with parameter n=H×W; three 1×1 convolutions are used to project X for query, key, and value embedding:

Q,K,V∈R ^d×n , Q represents the query matrix, K represents the key matrix, and V represents the value matrix; the output of self-attention is a scaled dot product:

Perform another convolution with a kernel size of 3×3, a stride of 1, a padding of 4, and a void rate of 4, batch normalize the data flow layer norm, and activate the Relu function; perform another convolution with a kernel size of 3×3, a stride of 1, a padding of 2, and a void rate of 2, batch normalize the data flow layer norm, and activate the Relu function; perform another convolution with a kernel size of 3×3, a stride of 1, and a padding of 1, batch normalize the data flow layer norm, and activate the Relu function; then add the output feature map of the first convolution and the feature map of the last convolution, and the final output size is a 9×9×512 feature map.

4. A method for detecting spliced tampered images as claimed in claim 1, characterized in that the cross attention module specifically comprises: firstly adding position coding to the low-level feature map U∈R ^d×2H×2W , where H and W are the height and width of the feature map, d is the number of channels, and R represents the real number domain; then, downsampling the low-level feature map U once and using it as the value matrix of the cross attention module; adding position coding to the high-level feature map N∈R ^2d×H×W , and using 1×1 convolution to change the number of channels to d as the query matrix and key matrix;

Then, the attention matrix A∈Rn ^×n is calculated using matrix multiplication and softmax function, where n=H×W; the calculated weight values are rescaled by the Relu activation function to obtain the result S as the filter, where low-amplitude elements represent noise or irrelevant areas to be reduced;

Take the Hadamard product of U and S to get the version of U with non-semantic features filtered out;

Finally, the result of the filtering operation is concatenated with the high-level feature map N.

5. A method for detecting spliced tampered images as described in claim 1, characterized in that the feature decoding module upsamples the feature map obtained from the self-attention U-shaped block and cascades the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-4F; upsamples the output feature map of the residual U-shaped block-4F and cascades the output feature map of the cross attention module, and the obtained feature map is sent to the residual U-shaped block-4; upsamples the output feature map of the residual U-shaped block-4 and cascades the output feature map of the cross attention module to obtain the feature map and sends it to the residual U-shaped block-5; upsamples the output feature map of the residual U-shaped block-5 and cascades the output feature map of the cross attention module The output feature maps of the modules are cascaded, and the obtained feature maps are sent to the residual U-shaped block-6; the output feature map of the residual U-shaped block-6 is upsampled and the output feature map of the cross attention module is cascaded, and the obtained feature map is sent to the residual U-shaped block-7; the feature map of the residual U-shaped block-7 with a size of 288×288×64 is output through a 3×3 convolution layer to output a feature map of size 288×288×1; and then a single-channel tampering probability mask map with a size of 288×288×1 is obtained through the Sigmiod activation function; the entire neural network optimizes the predicted results by minimizing the cross entropy loss function through the stochastic gradient descent optimization algorithm. The cross entropy loss function is: