CN116664633B

CN116664633B - Printed image registration method based on convolutional cross-attention mechanism

Info

Publication number: CN116664633B
Application number: CN202310624605.7A
Authority: CN
Inventors: 陈亚军; 杨茜; 余璐; 蔺广逢; 张二虎
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2025-11-18
Anticipated expiration: 2043-05-30
Also published as: CN116664633A

Abstract

This invention discloses a method for registering printed images based on a convolutional cross-attention mechanism. First, a deep learning registration network is constructed, including a convolutional cross-attention mechanism and a registration network based on upsampling-based depth homography estimation. Then, the reference printed image and the image to be registered are input into the deep learning registration network to obtain the registered printed image. Finally, the network parameters are optimized by calculating the loss function between the registered printed image and the reference printed image, resulting in a more accurate output printed image. This invention completes the printed image registration task, providing a guarantee for subsequent defect detection of printed materials and improving defect detection efficiency.

Description

Printed matter image registration method based on convolution cross attention mechanism

Technical Field

The invention belongs to the technical field of deep neural networks and image analysis, and particularly relates to a printed matter image registration method based on a convolution cross attention mechanism.

Background

The printing industry is an important industrial support for the national economy of China. In daily life, books, periodicals, magazines, newspapers, gift boxes, business cards and the like contacted by people all belong to the category of printed matters. The printed matter is closely related to the life of people. However, in the production process of printed matter, some defects such as ink flying, offset, color difference, cutting deviation and the like are unavoidable due to the influence of external factors or the cause of internal equipment. In the defect detection technology of printed matters, the foremost task is registration, two printed matter images are aligned geometrically through registration, the reference printed matter image and the printed matter image to be registered are aligned in space positions, then defect detection is carried out, the accuracy of the defect detection is determined by the quality of the registration, and the method has important research significance and practical value.

The continuous development of the deep learning technology provides a new idea for a printed matter image registration method. The image registration method based on deep learning is not limited to feature extraction, but can also estimate geometric transformation between images through a neural network for alignment. The unsupervised depth homography estimation model does not depend on a real label, and the training network can be optimized by calculating similarity measurement of the registration image and the reference image. The method not only can learn the characteristics, but also can estimate homography, and has good image registration effect on large displacement and illumination variation. The image registration of the printed matter can effectively register the images of the two printed matters at the space position, so that the guarantee is provided for the defect detection of the later-stage printed matter, and the defect detection efficiency is improved.

Disclosure of Invention

The invention aims to provide a printed matter image registration method based on a convolution cross attention mechanism, which completes the printed matter image registration task, provides guarantee for defect detection of later printed matters and improves defect detection efficiency.

The technical scheme adopted by the invention is that the printed matter image registration method based on a convolution cross attention mechanism is implemented according to the following steps:

Step1, constructing a deep learning registration network, wherein the deep learning registration network comprises a convolution cross attention mechanism and an up-sampling-based depth homography estimation registration network;

Step 2, inputting a patch (p ^B) of a reference printed matter image and a patch (p ^A) of a printed matter image to be registered into a deep learning registration network, predicting four corner offsets H '_4pt of p ^A relative to four corner offsets on the reference printed matter image p ^B, and performing direct linear transformation through DLT to obtain a transformation matrix H';

Step 3, performing space transformation on the transformation matrix H' and the printed matter image A to be registered to obtain a registered printed matter image;

and 4, optimizing network parameters by calculating a loss function between the registered printed matter image and the reference printed matter image, and outputting the printed matter image with higher precision.

The present invention is also characterized in that,

The convolution cross-attention mechanism in the step 1 is implemented specifically according to the following steps:

Step 1.1, inputting tensors X ₁,X₂∈R^H×W×C of two given shapes, wherein H represents the height of an input feature diagram, W represents the width of the input feature diagram, C represents the channel number of the input feature diagram, and the sizes of X ₁ and X ₂ are 64 multiplied by 32;

Step 1.2, in order to ensure that the image processing contains translational isomorphism attributes, the existing relative position codes are expanded to two dimensions, width information and height information are embedded in the relative positions of the cross attention, so that the two-dimensional relative cross attention is realized, and the attention degree of a pixel i= (i _x,i_y) to a pixel j= (j _x,j_y) is calculated as a formula (1):

Where l _i,j denotes the degree of attention of pixel i= (i _x,i_y) to pixel j= (j _x,j_y), Representing a transpose of the pixel i query vector,Representing the depth of key k, k _j is the key vector for pixel j,AndRepresenting a relative width j _x-i_x and a relative height j _y-i_y;

Step 1.3, the output of the two-dimensional single-head cross attention is expressed as a formula (2):

Where O _h represents the output of two-dimensional single-headed cross-attention, softmax (·) represents normalization, W _Q represents query weight, W _k represents weight of keys, W _v represents weight of values, A logical matrix representing the relative positions of width and height, X ₁ representing the tensor form of the feature map 1, X ₂ representing the tensor form of the feature map 2,Indicating the depth of key k.

Step 1.4, the multi-head attention is formed by splicing single-head attention, as shown in formula (3):

MHA(X)=Concat[O₁,...,O_Nh]W^O (3)

Wherein MHA (X) represents a multi-head attention tensor of the shape (H, W, d _v), concat [. Cndot ] represents stitching, O ₁,...,O_Nh represents single-head attention, Representing a weight vector;

Step 1.5, mapping and connecting the convolution and the multi-head cross attention feature map to obtain convolution cross attention, which can be written as a formula (4):

AAConv(X)=Concat[Conv(X),MHA(X)] (4)

Wherein AAConv (X) represents convolution cross-attention, concat [ · ] represents concatenation, conv (X) represents convolution, MHA (X) represents multi-head attention tensor of shape (H, W, d _v);

Step 1.6, the convolution cross attention is normalized in batches, so that a characteristic diagram X '₁,X′₁ fused with X ₂ characteristics is 64×64×32, and X ₁ is replaced by X' ₁.

In the step 1, the registration network is estimated based on the up-sampling depth homography, and the method is implemented according to the following steps:

Step a, the feature map tensor X ₁ and the feature map tensor X ₂ obtained in the step 1 are connected in series to obtain a single feature map with the size of 64 multiplied by 64;

Step b, carrying out transformation operation on the feature map with the size of 64 multiplied by 64;

Step c, performing transformation operation on the feature map with the size of 32 multiplied by 128;

step d, inputting the feature map obtained in the step c into a full-connection layer Linear1, inputting feature vectors with the size of 16 multiplied by 256, and outputting output features with the size of 1024;

and e, inputting the output characteristic with 1024 obtained in the step d into the full-connection layer Linear2, inputting the characteristic vector with 1024, and outputting the characteristic vector with 8.

Step b is specifically performed according to the following steps:

Step b1 a feature map of size 64X 64 is convolved by 3X 3, the output channel is 96, the padding is 1, and a characteristic diagram X _a with the size of 64 multiplied by 96 is obtained;

Step b2 up-sampling of 3X 3 is performed on a feature map of size 64X 64, the output channel is 32, the padding is 1, and a characteristic diagram X _b with the size of 64 multiplied by 32 is obtained;

Step b3, connecting the feature maps X _a and X _b, and activating with LeakyReLU with a negative slope of 0.2 to obtain a feature map with the size of 64×64×128;

Step b4, carrying out 3×3 convolution on the feature map obtained in the step b3, wherein the output channel is 128, the padding is 1, and the activation is carried out by LeakyReLU with a negative slope of 0.2, so as to obtain a feature map with the size of 64×64×128;

step b5, carrying out maximum pooling with the kernel of 2 on the feature map obtained in the step b4, and finally obtaining the feature map with the size of 32 multiplied by 128.

Step c is specifically performed according to the following steps:

Step c1, performing 3×3 convolution on a feature map with a size of 32×32×128, and obtaining a feature map X _a′ with a size of 32×32×192, wherein the output channel is 192 and the padding is 1;

step c2, up-sampling the feature map with the size of 32×32×128 by 3×3, wherein the output channel is 64, and the padding is 1, so as to obtain a feature map X _b′ with the size of 32×32×64;

Step c3, connecting the feature maps X _a′ and X _b′, and activating with LeakyReLU with a negative slope of 0.2 to obtain a feature map with the size of 32 multiplied by 256;

step c4, carrying out 3×3 convolution on the feature map obtained in the step c3, wherein the output channel is 256, the padding is 1, and the activation is carried out by LeakyReLU with a negative slope of 0.2, so as to obtain a feature map with a size of 32×32×256;

Step c5, carrying out maximum pooling with a kernel of 2 on the feature map obtained in the step c4, and finally obtaining the feature map with the size of 16 multiplied by 256.

The method has the beneficial effects that the two printed matter images can be effectively registered in the space position by the method for registering the printed matter images based on the convolution cross attention mechanism, so that the guarantee is provided for the defect detection of the later printed matter, and the defect detection efficiency is improved. The invention provides a convolution cross attention mechanism, wherein a key-value inquiry mechanism of the mechanism can effectively fuse characteristic information of two images, corresponds hot spot area characteristics of the two images and fuses the hot spot area characteristics into a characteristic diagram, and accords with a generalized flow of image registration. The invention provides a print image registration method based on a convolution cross attention mechanism, which comprises the steps of firstly designing a parallel network to add the cross attention mechanism to process an input reference print image and a print image to be registered, fusing the characteristics of the two images, and then rolling and upsampling the characteristic images of the two images in a serial connection mode, wherein the designed upsampling mode can reduce the loss of the characteristics.

Drawings

FIG. 1 is a flow chart of a method of print image registration based on a convolution cross-attention mechanism of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of a printed image registration network based on a convolutionally cross-attentive mechanism of the present invention;

fig. 3 is a diagram of a print image to be registered, a reference print image, and a registration result in an embodiment of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a printed matter image registration method based on a convolution cross attention mechanism, wherein a flow chart is shown in fig. 1, and the method is implemented specifically according to the following steps:

MHA(X)=Concat[O₁,...,O_Nh]W^O (3)

AAConv(X)=Concat[Conv(X),MHA(X)] (4)

Step b is specifically performed according to the following steps:

Step c is specifically performed according to the following steps:

Step 2, inputting a patch (p ^B) of a reference print image and a patch (p ^A) of a print image to be registered into a deep learning registration network, predicting four corner offsets H '_4pt of p ^A relative to four corner offsets on the reference print image p ^B, and performing direct linear transformation through DLT to obtain a transformation matrix H';

Example 1

Referring to fig. 1, the method for registering printed matter images based on convolution cross attention mechanism specifically comprises the following steps:

step 1, constructing a deep learning registration network, which comprises a convolution cross attention mechanism and an up-sampling-based depth homography estimation registration network;

Step 1.1, inputting tensors X ₁,X₂∈R^H×W×C of the print image to be registered and the reference print image into a parallel network, wherein H represents the height of an input feature map, W represents the width of the input feature map, C represents the channel number of the input feature map, and the size of X ₁,X₂ is 128 multiplied by 1.

Referring to fig. 2 and 3, step 1.2, in the parallel network, convolution of 3×3 is performed on X ₁,X₂ twice, the output channel is 32, the step size is 1, the padding is 1, the activation function is LeakyReLU with a negative slope of 0.2, so as to obtain a feature map with a size of 128×128×32, and the feature map is subjected to maximum pooling with a kernel of 2, so that a feature map with a size of 64×64×32 is finally obtained.

Step 1.3, inputting a feature map with the size of 64 multiplied by 32 into a convolution cross attention module, wherein the depth dk of a key is=10, the depth dv of a value is=1, the head number Nh of multi-head attention is=1, the feature map comprises a convolution module and an attention module, wherein an output channel dv of the convolution module, a kernel size of 3, a step size of 1 and a running of 1 are included, the input and output channels of the attention module are dv, the kernel size of 1 and the step size of 1, the convolution module and the attention module are spliced to obtain a feature map X '₁,X′₁ with the size of 64 multiplied by 32, which is fused with X ₂ features, and X ₁ is replaced by X' ₁;

Step 1.4, the characteristic graph tensor X ₁ obtained in step 1.3 and the characteristic graph tensor X ₂ input in step 1.1 are connected in series, a single feature map of size 64X 64 is obtained.

In the step 1.5 of the method, for a size of 64 x 64 the feature map performs the following operations:

(1) A feature map of size 64X 64 is convolved by 3X 3, the output channel is 96, the padding is 1, and a characteristic diagram X _a with the size of 64 multiplied by 96 is obtained;

(2) Up-sampling of 3X 3 is performed on a feature map of size 64X 64, the output channel is 32, the padding is 1, and a characteristic diagram X _b with the size of 64 multiplied by 32 is obtained;

(3) Connecting the feature maps X _a and X _b, and activating with LeakyReLU with a negative slope of 0.2 to obtain a feature map with the size of 64×64×128;

(4) Performing 3×3 convolution on the feature map obtained in (3), wherein the output channel is 128, the padding is 1, and the feature map is activated by LeakyReLU with a negative slope of 0.2, so as to obtain a feature map with a size of 64×64×128;

(5) And (3) carrying out maximum pooling of the kernel 2 on the feature map obtained in the step (4) to finally obtain the feature map with the size of 32 multiplied by 128.

Step 1.6, the following operation is performed on the feature map with the size of 32×32×128:

(1) The feature map with the size of 32×32×128 is subjected to 3×3 convolution, the output channel is 192, the padding is 1, and a feature map X _a′ with the size of 32×32×192 is obtained;

(2) Up-sampling 3×3 is performed on a feature map with a size of 32×32×128, the output channel is 64, and padding is 1, so as to obtain a feature map X _b′ with a size of 32×32×64;

(3) Connecting the feature maps X _a′ and X _b′, and activating with LeakyReLU with a negative slope of 0.2 to obtain a feature map with the size of 32×32×256;

(4) Performing 3×3 convolution on the feature map obtained in (3), wherein the output channel is 256, the padding is 1, and the feature map is activated by LeakyReLU with a negative slope of 0.2, so as to obtain a feature map with a size of 32×32×256;

(5) And (3) carrying out maximum pooling of the kernel 2 on the feature map obtained in the step (4) to finally obtain the feature map with the size of 16 multiplied by 256.

And step 1.7, inputting the feature map obtained in the step 1.6 into the full connection layer Linear1, inputting feature vectors with the size of 16 multiplied by 256, and outputting output features with the size of 1024.

And step 1.8, inputting the output characteristic with 1024 obtained in the step 1.7 into the full-connection layer Linear2, inputting the characteristic vector with 1024, and outputting the characteristic vector with 8.

Step 2, obtaining a feature vector with the size of 8 in the step 1.8, namely a deep learning registration network, predicting four corner offsets H '_4pt of the four corner points of the p ^A relative to the four corner points on the reference printed matter image p ^B, and performing direct linear transformation through DLT to obtain a transformation matrix H';

step 3, performing spatial transformation on the transformation matrix H 'and the printed matter image A to be registered to obtain a registered printed matter image p' ^B;

And 4, optimizing network parameters by calculating a loss function between the registered printed matter image and the reference printed matter image, and outputting the printed matter image with higher precision as shown in fig. 3. Wherein red represents the true perspective transformation, yellow represents the perspective transformation estimated by the model, and the more coincident the two, the higher the accuracy of the registration.

The invention provides a printed matter image registration method based on a convolution cross attention mechanism, which can better complete registration tasks and acquire registered images by improving an unsupervised depth homography estimation model to perform image registration on printed matters, has important significance on defect detection of the printed matters and improves defect detection efficiency.

Example 2

MHA(X)=Concat[O₁,...,O_Nh]W^O(3)

AAConv(X)=Concat[Conv(X),MHA(X)] (4)

Example 3

Step b is specifically performed according to the following steps:

Step c is specifically performed according to the following steps:

Claims

1. A method for image registration of printed materials based on convolutional cross-attention mechanism, characterized by the following steps:

Step 1: Construct a deep learning registration network;

The convolutional cross-attention mechanism in step 1 is implemented according to the following steps:

Step 1.1: Input two tensors of given shapes. H represents the height of the input feature map, W represents the width of the input feature map, and C represents the number of channels in the input feature map. and The size of each is 64×64×32;

Step 1.2: Extend the existing relative position encoding to two dimensions by embedding width and height information into the relative positions of the cross-attention, thus achieving two-dimensional relative cross-attention, pixel-wise. For pixels The attention level is calculated using formula (1):

(1)

in, Represents pixels For pixels attention, This represents the transpose of the query vector for pixel i. Indicates the depth of key k. It is the key vector of pixel j. and Indicates relative width and relative height ;

Step 1.3, the output of the two-dimensional single-head cross attention is formula (2):

(2)

in, This represents the output of a two-dimensional single-head cross-attention function, where softmax (·) represents normalization. Indicates the query weight. Indicates the weight of the key. The weight of the value. , A logical matrix representing the relative positions of width and height. The tensor form representing feature map 1, The tensor form representing feature map 2, Indicates the depth of bond k;

Step 1.4: Multi-head attention is composed of single-head attention, as shown in formula (3):

(3)

in, Indicates a shape of (H, W, The multi-head attention tensor of ) Indicates splicing, This indicates single-head attention. Represents the weight vector;

Step 1.5: Map and connect the convolutional and multi-head cross-attention feature maps to obtain convolutional cross-attention, which can be written as formula (4):

(4)

in, Indicates convolutional cross attention. Indicates splicing, Represents convolution. Indicates a shape of (H, W, The multi-head attention tensor;

Step 1.6: Batch normalize the convolutional cross attention to obtain the fused... Feature map of features , The size is 64×64×32. Replace with ;

Step 2: Input the reference printed image and the printed image to be registered into the deep learning registration network, and obtain the transformation matrix through direct linear transformation (DLT). ;

Step 3: Transform the matrix A spatial transformation is performed between the image A to be registered and the image A of the printed matter to be registered to obtain the registered printed image;

Step 4: Optimize the network parameters by calculating the loss function between the registered printed image and the reference printed image to output a printed image with higher accuracy.

2. The printed image registration method based on convolutional cross-attention mechanism according to claim 1, wherein the deep learning registration network in step 1 includes a convolutional cross-attention mechanism and a depth homography estimation registration network based on upsampling.

3. The method for registering printed images based on convolutional cross-attention mechanism according to claim 2, characterized in that the upsampled depth homography estimation registration network in step 1 is implemented according to the following steps:

Step a: Convert the feature map tensor obtained in step 1 into a multi-dimensional array. and feature map tensor By concatenating them, a single feature map of size 64×64×64 is obtained;

Step b: Perform a transformation operation on the feature map of size 64×64×64;

Step c: Perform a transformation operation on the feature map of size 32×32×128;

Step d: Input the feature map obtained in step c into the fully connected layer Linear1. The input feature vector is 16×16×256 and the output feature is 1024.

Step e: Input the output feature vector of size 1024 obtained in step d into the fully connected layer Linear2. The input feature vector is of size 1024, and the output feature vector is of size 8.

4. The printed image registration method based on convolutional cross-attention mechanism according to claim 3, characterized in that step b is specifically implemented according to the following steps:

Step b1: Perform a 3×3 convolution on the 64×64×64 feature map, with 96 output channels and padding of 1, to obtain a feature map of size 64×64×96. ;

Step b2: Upsample the 64×64×64 feature map by 3×3, output 32 channels, and padding to 1 to obtain a feature map of size 64×64×32. ;

Step b3: Connect feature maps and Activation with LeakyReLU with a negative slope of 0.2 yields a feature map of size 64×64×128;

Step b4: Perform a 3×3 convolution on the feature map obtained in step b3, with 128 output channels and padding of 1. Activate with LeakyReLU with a negative slope of 0.2 to obtain a feature map of size 64×64×128.

Step b5: Perform max pooling with kernel 2 on the feature map obtained in step b4, and finally obtain a feature map of size 32×32×128.

5. The printed image registration method based on convolutional cross-attention mechanism according to claim 4, characterized in that step c is specifically implemented according to the following steps:

Step c1: Perform a 3×3 convolution on the feature map of size 32×32×128, with 192 output channels and padding of 1, to obtain a feature map of size 32×32×192. ;

Step c2: Upsample the 32×32×128 feature map by 3×3, output 64 channels, and set padding to 1 to obtain a feature map of size 32×32×64. ;

Step c3: Connect feature maps and Activation with LeakyReLU with a negative slope of 0.2 yields a feature map of size 32×32×256;

Step c4: Perform a 3×3 convolution on the feature map obtained in step c3, with 256 output channels and padding of 1. Activate with LeakyReLU with a negative slope of 0.2 to obtain a feature map of size 32×32×256.

Step c5: Perform max pooling with kernel 2 on the feature map obtained in step c4, and finally obtain a feature map of size 16×16×256.