CN120726659B - Passport background texture image recognition method, system, storage medium and device - Google Patents
Passport background texture image recognition method, system, storage medium and deviceInfo
- Publication number
- CN120726659B CN120726659B CN202511199053.5A CN202511199053A CN120726659B CN 120726659 B CN120726659 B CN 120726659B CN 202511199053 A CN202511199053 A CN 202511199053A CN 120726659 B CN120726659 B CN 120726659B
- Authority
- CN
- China
- Prior art keywords
- passport
- branch
- background texture
- spectrum
- cnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/42—Document-oriented image-based pattern recognition based on the type of document
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/54—Extraction of image or video features relating to texture
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a passport background texture image recognition method, a system, a storage medium and a device, which belong to the field of image processing and comprise the steps of inputting a passport background texture image to be recognized, recognizing the passport background texture image to be recognized through a double-branch coupled time convolution network TCN, wherein the double-branch coupled time convolution network TCN comprises a CNN branch used for extracting local texture features of the passport background texture image to be recognized, a transform branch used for extracting global texture features of the passport background texture image to be recognized, and a context interaction enhancement fusion module used for fusing the features extracted by the CNN branch and the transform branch, constructing bidirectional attention interaction between the CNN branch and the transform branch, and introducing a gating mechanism to adaptively fuse the features after context enhancement. The method and the device can effectively utilize local fine characteristics and global structure information of the passport background texture at the same time, and improve the accuracy and the robustness of passport background texture classification.
Description
Technical Field
The present invention relates to the field of image processing, and in particular, to a method, a system, a storage medium, and an apparatus for recognizing a passport background texture image.
Background
The passport background texture is a key security feature that contains unique security information that is difficult to reproduce due to its fine printed design and special microscopic characteristics. However, with the continuous progress of the forging technology, the simulation degree of the forged passport is higher and higher, so that the difficulty of the traditional manual checking or simple machine identification method is obviously increased, and the efficiency and accuracy are greatly compromised.
When the existing automatic identification method processes highly simulated forged passports, the identification accuracy rate is still insufficient, and false or missing report is easy to occur. In addition, most existing methods focus on simple combinations of global features or local features, and the lack of deep mining and efficient integration of micro-texture features on the passport background results in unstable recognition performance for weak texture features, limiting its practical application. Convolutional neural network CNN performs well in terms of image local feature extraction, while the Transformer architecture is adept at capturing the global dependencies of the image. How to effectively combine the advantages of the two, and fully utilize the local detail and the global structure information of the passport background texture is an important research direction for improving the classification accuracy. While prior studies began to attempt to combine CNN and Transformer, there was still room for further optimization in a particular application scenario for passport background texture. The method has the defects that 1, the spectrum characteristics are not utilized enough, the image frequency domain information cannot be modeled explicitly, 2, the local modeling and global modeling are split, the CNN and the transducer cannot be modeled cooperatively, 3, the main frequency region is not focused strongly, the characteristic extraction lacks a spectrum guiding mechanism, 4, the spectrum level supervision mechanism is lacking, and the model is difficult to learn the frequency structure of a real passport. The above defects result in low recognition accuracy and poor robustness under weak texture background or high simulation counterfeit images.
Therefore, how to design a deep learning model capable of effectively utilizing local fine features and global structural information simultaneously according to the characteristics of passport background textures and achieving optimization fusion of the two features so as to improve the accuracy and the robustness of passport background texture classification is a key technical problem to be solved in the current passport authenticity identification field.
Disclosure of Invention
The invention aims to overcome the technical problems in the prior art and provides a method, a system, a storage medium and equipment for recognizing a passport background texture image.
The aim of the invention is realized by the following technical scheme:
in a first aspect, a method for recognizing a passport background texture image is provided, including:
Inputting a passport background texture image to be identified;
identifying the passport background texture image to be identified through a double-branch coupled time convolution network TCN, and outputting a classification result of the passport background texture image, wherein the double-branch coupled time convolution network TCN comprises:
a CNN branch for extracting local texture features of the passport background texture image to be identified;
A transducer branch for extracting global texture features of a passport background texture image to be identified;
The context interaction enhancement fusion module is used for fusing the characteristics extracted by the CNN branch and the transducer branch, constructing bidirectional attention interaction between the CNN branch and the transducer branch, realizing cross-branch characteristic enhancement, and introducing a gating mechanism to adaptively fuse the characteristics after the context enhancement so as to improve the accuracy and the robustness of characteristic expression.
In some embodiments, the CNN branches comprise a spectrum guiding type convolution network structure, a spectrum guiding module is added in front of each convolution module, explicit modeling of directivity and periodicity characteristics is verified through Gabor verification of different directions and scales to enhance sensitivity of CNN to weak texture characteristics, a multi-scale frequency channel module is added behind each convolution module, the multi-scale frequency channel module is used for calculating two-dimensional spectrum energy of each channel in each layer of output characteristic diagram, frequency attention weights are generated through normalization and multi-layer perceptrons, and weighted enhancement is carried out on channels with dense spectrum information, so that modeling capacity of CNN in a frequency domain is improved.
In some embodiments, a spectrum-consistency regularization mechanism (CNN spectrum-consistency regularization mechanism) is introduced in the CNN branches, which directs CNN to learn feature representations proximate to the real spectrum structure by KL-divergence matching with a predefined real passport spectrum template.
In some embodiments, the transducer branch includes a spectrum sensing transducer module, in which a global spectrum position coding mechanism and a spectrum aggregation attention mechanism are embedded, to enhance the sensing capability of the transducer model for image periodicity and directional texture.
In some embodiments, a spectrum uniformity regularization mechanism (transform spectrum uniformity regularization mechanism) is introduced in the transform branches, and by estimating the intermediate frequency spectrum distribution and performing KL divergence alignment with the real passport average spectrum template, the sensitivity of the transform to the primary frequency structure is enhanced.
In some embodiments, a spectrum consistency regularization mechanism (fusion level spectrum consistency regularization) is introduced into the context interaction enhanced fusion module, and the global feature expression is guided to be close to the real texture structure by supervising the consistency of the spectrum distribution of the final fusion feature map and the real passport spectrum template.
In some embodiments, the dual-branch coupled time convolution network TCN employs a multi-objective joint loss function framework including a main classification loss and a plurality of spectral guide loss terms to achieve high precision classification of passport background texture images consistent with spectral structure.
In a second aspect, there is provided a passport background texture image identification system comprising:
The image input module is used for inputting a passport background texture image to be identified;
The image recognition module is used for recognizing the passport background texture image to be recognized through a double-branch coupled time convolution network TCN and outputting a classification result of the passport background texture image, wherein the double-branch coupled time convolution network TCN comprises:
a CNN branch for extracting local texture features of the passport background texture image to be identified;
A transducer branch for extracting global texture features of a passport background texture image to be identified;
The context interaction enhancement fusion module is used for fusing the characteristics extracted by the CNN branch and the transducer branch, constructing bidirectional attention interaction between the CNN branch and the transducer branch, and introducing a gating mechanism to adaptively fuse the characteristics after the context enhancement.
In a third aspect, a computer readable storage medium is provided, where a computer program is stored, where the computer program, when executed by a processor, implements a method for identifying a passport background texture image according to the first aspect.
In a fourth aspect, an electronic device is provided, including a memory and a processor, the memory storing computer instructions executable on the processor, the processor executing the method for identifying a passport background texture image according to the first aspect when the computer instructions are executed.
It should be further noted that the technical features corresponding to the above embodiments may be combined with each other or replaced to form a new technical solution without collision.
Compared with the prior art, the invention has the beneficial effects that:
1. The invention utilizes a double-branch coupled time convolution network TCN to identify the passport background texture image, wherein local texture features are extracted through CNN branches, global texture features are extracted through a transducer branch, and the characteristics extracted by the CNN branches and the transducer branch are fused through a context interaction enhancement fusion module, so that bidirectional attention interaction between the CNN branches and the transducer branch is realized, cross-branch feature enhancement is realized, unified frequency domain semantic representation is constructed, the feature expression capability is improved, and the classification accuracy of the passport background texture image is further improved.
2. According to the invention, the frequency spectrum guiding module and the multi-scale frequency channel module are integrated in the CNN branch, wherein the frequency spectrum guiding module enhances the sensitivity of CNN to weak texture features through explicit modeling of Gabor verification of directivity and periodicity features in different directions and scales, and the multi-scale frequency channel module enhances the weighting of channels with dense frequency spectrum information, so that the modeling capability of CNN in a frequency domain is improved. The method has good stability under complex scenes such as high simulation, weak texture and the like.
3. According to the invention, a global spectrum position coding mechanism and a spectrum aggregation attention mechanism are introduced into a transducer branch, so that the perception capability of a transducer model on image periodicity and directional texture structures is improved.
4. According to the invention, the consistency of the frequency spectrum structure is realized, and particularly, the model learning result is more fit with the frequency characteristic of a real passport through the regular constraint of multi-level frequency spectrum consistency;
5. The time convolution network TCN with double-branch coupling adopts a multi-target joint loss function frame, comprises main classification loss and a plurality of frequency spectrum guiding loss items, ensures the consistency of high-precision classification and frequency spectrum structure of the passport background texture image, and realizes the synchronous improvement of robustness and precision.
Drawings
FIG. 1 is a schematic diagram of the overall framework of a TCN model according to the present invention;
FIG. 2 is a schematic diagram of CNN block in a CNN branch of the present invention;
FIG. 3 is a schematic view of Transblock in a Transformer branch of the present invention;
fig. 4 is a schematic structural diagram of a context interaction enhanced fusion module according to the present invention.
Detailed Description
The following description of the embodiments of the present application will be made apparent and fully understood from the accompanying drawings, in which some, but not all embodiments of the application are shown. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the above solutions in the prior art have all the drawbacks that the inventors have obtained after they have practiced and studied carefully, and therefore, the discovery process of the above problems and the solutions that the embodiments of the present application in the following description address the above problems should be considered as contributions of the inventors to the application in the inventive process, and should not be construed as being what is known to those skilled in the art.
Aiming at the technical problems pointed out in the background art, the embodiment provided by the invention is as follows:
In an exemplary embodiment, a method for identifying a passport background texture image is provided, comprising:
Inputting a passport background texture image to be identified;
identifying the passport background texture image to be identified through a double-branch coupled time convolution network TCN, and outputting a classification result of the passport background texture image, wherein the double-branch coupled time convolution network TCN comprises:
a CNN branch for extracting local texture features of the passport background texture image to be identified;
A transducer branch for extracting global texture features of a passport background texture image to be identified;
The context interaction enhancement fusion module is used for fusing the characteristics extracted by the CNN branch and the transducer branch, constructing bidirectional attention interaction between the CNN branch and the transducer branch, realizing cross-branch characteristic enhancement, and introducing a gating mechanism to adaptively fuse the characteristics after the context enhancement.
Specifically, the passport background texture recognition model provided by the invention integrally adopts a double-branch coupling structure design, as shown in fig. 1, in a CNN branch, an input image is firstly subjected to preliminary channel mapping and downsampling through a two-dimensional convolution layer, and then multi-scale local texture features are extracted through four residual convolution modules (CNN Block1 to CNN Block 4) in sequence. In the transform branch, the input image is first divided into a plurality of image blocks by a Patch & transform module, where Patch represents the image block, the input passport background texture image is divided into a plurality of small blocks, the transform represents the linear projection, each Patch is linearly transformed, and mapped to a high-dimensional vector. And then sequentially inputting the images into four Trans Block layers (Trans Block1 to Trans Block 4), extracting global features of the images by each Trans Block layer through a self-attention mechanism and a multi-layer perceptron, enhancing modeling capability on texture periodicity and directionality by combining a spectrum sensing mechanism, and modeling global dependency relations among image blocks by a multi-head self-attention mechanism. In order to effectively fuse the local features of the CNN branches with the global structure information of the transducer branches, the present invention provides a context interaction enhanced fusion module (CCAF), compressing the spatial dimensions (height and width) of the feature map by Global Averaging Pooling (GAP), and then outputting the classification of the passport background texture by full link layer (Fc), lithography printing in fig. 1 represents lithography, LASER PRINTING represents laser printing, INK PRINTING represents inkjet printing, transformed design represents transformation design.
CNN branching
The invention provides a spectrum-Guided convolutional network structure (spectrum-Guided CNN), which is shown in fig. 2, and explicitly models the texture response characteristic of a passport background texture image in a spatial frequency domain, so that the sensitivity of the CNN to characteristics such as orientation, periodicity, weak textures and the like is enhanced. Passport background textures have distinct periodicity, directionality, and fine granularity characteristics, and thus can be modeled as a two-dimensional finite support functionThe spectral distribution in the spatial domain, the spectrum of which is defined by a two-dimensional fourier transform:
, wherein, Representing a two-dimensional finite support functionThe spectral distribution in the spatial domain,Representing correspondence in the frequency domainSpatial frequency components of the direction (transverse frequency),Representing correspondence in the frequency domainSpatial frequency components of the direction (longitudinal frequency),Representing imaginary units.
This spectral distribution reveals the energy distribution of the passport background at different spatial frequencies, whereas conventional CNNs have difficulty in explicitly modeling this spectrum under receptive field limitations. Therefore, the invention designs a spectrum-guided convolutional network structure, which realizes the organic combination of CNN and texture spectrum modeling through the following three aspects:
1.1. Spectrum guide structure design
The CNN backbone takes the modules in ResNet and adds a Gabor spectrum guidance Module (Gabor Module) before each of them ResNet block. The Gabor spectrum guidance module uses Gabor kernels (e.g., 0 °, 45 °, 90 °, 135 °) of different directions and scales to obtain a plurality of directional responses by convolution with an input image or feature map:
, wherein, Indicating the passing direction asAfter Gabor kernel filtering,Representing an input image or a feature map,Representing Gabor filter kernels in the direction ofAll direction responses are then input into a 1x1 convolution for channel compression after stitching, and the final output is taken as the input to the backbone ResNet block:
, wherein, The convolution of the representation 1*1,Representing a concatenation operation in the channel dimension,The final result of the convolution is represented as an input feature to the post-convolution backbone ResNet block (residual convolution module).
This structure provides the CNN front layer with direction sensing capabilities, directing it to focus on the salient structure direction in the real passport texture.
Further, batchNorm d+ReLU in the residual convolution module is a batch normalization+linear rectification unit, batchNorm d is used for carrying out channel level normalization on the feature graphs output by each layer, accelerating training convergence, reducing internal covariate offset, being capable of being used independently, and ReLU is used for linear rectification, and enhancing feature expression capability.
ResNet module
ResNet18 as one of excellent network structures in the CV field has strong feature extraction capability, so the structural design of ResNet is consulted and related structures are used. Specifically, each ResNet block consists of two basic blocks in series. Each basic block contains two convolutional layers and one Batch Normalization (BN) operation. In the first basic block, the input feature map is first convolved, then batch normalized and ReLU activated, and finally processed by the second convolution layer. The second basic block has a similar structure and also contains two convolution operations and a batch normalization operation. In addition, in order to avoid the gradient vanishing problem in the deep network training process, residual connection is introduced inside each basic module so as to realize effective characteristic migration and stable propagation of gradient flow. The structure of a single ResNet block can be described by the following formula:
,
Wherein the method comprises the steps of Is an intermediate feature of the ResNet blocks,Is a convolution operation and is performed by,Is a batch normalization operation and FCNN is the final output of one CNN branch.
1.3. Multi-scale frequency channel module (MS-FCEM)
In order to capture the frequency domain response characteristics of textures under different scales, a set of frequency energy modeling mechanism is designed, and a characteristic diagram is output for each layerCalculates its two-dimensional spectral energy for each channel:
, wherein, The number of channels is indicated and the number of channels is indicated,The height is indicated by the height of the platform,The width is indicated as such,Represent the firstLayer numberA characteristic map of the individual channels is provided,Representing a two-dimensional discrete fourier transform,Representing the frequency corresponding intensity of the channel,Representing the frequency domain coordinates. The energy is used as a frequency attention weight for adjusting a channel with dense network attention spectrum information so as to avoid high-frequency noise interference. Vector energy per layerNormalizing and generating channel weighting coefficients using MLP:
The spectral weighting of each channel feature is then performed:
。
CNN spectral consistency regularization
In order to improve the frequency domain discrimination capability of the network to the true and false differences of the passports, a frequency spectrum consistency regularization term (Spectral Consistency Regularization) is introduced. The mechanism is implemented by constructing a spectral template of the real passport backgroundAs a reference for the learning of the network spectral response. The template is constructed as follows:
wherein Is the firstA real passport background image (removing noise and stamp interference). Since the passport background texture is highly standardized in design and the directional texture distribution is stable, the template has good repeatability and generalization. We introduce KL divergence as a regularization term, the objective function is:
wherein, the Representing a loss of spectral consistency of the CNN branches,Represents the degree of divergence of KL,Indicating the number of total convolutions of the layer,AndIs a normalization factor, and has the function of converting a spectrum energy diagram into a probability distribution form so as to enable the spectrum energy diagram to meet the definition requirement of KL divergence. The mechanism guides the network spectrum structure to be close to the real template, and improves the sensitivity of the model to the spectrum distortion of the fake texture.
2. Spectrum-aware transducer branching
To model the cross-region periodic structure, dominant frequency distribution, and directional texture in the passport image, the global branches employ a structured semantic modeling path composed of multiple layers of transformers. Unlike the conventional visual transducer which performs attention calculation only depending on spatial location information, the present invention proposes a spectrum-driven global modeling module, as shown in fig. 3, in each Transformer Block, a spectrum location coding, a dominant frequency region sensing and a spectrum aggregation attention mechanism are embedded, so as to form a global semantic expression path with a frequency domain sensing capability.
2.1. Global spectrum position code (Global Spectral Position Encoding, GSPE)
In order to improve the perception capability of a transducer to an image periodicity and directional texture structure, a spectrum position coding mechanism is provided, and a two-dimensional spectrogram is used for replacing the traditional fixed position coding, so that Token embedding with frequency perception capability is formed. First, for an input imagePerforming two-dimensional Fourier transform to obtain frequency spectrum amplitude map, down-sampling to fixed size by global average pooling, capturing frequency main distribution of the whole map, and mapping to frequency spectrum position coding vector by two-layer MLPAnd adding each Patch in an embedded way to form a Token sequence after spectrum modulation:
, wherein, Representing the first after spectral modulationThe number of Token vectors that are to be used,Represent the firstThe original embedded vector of the individual image block (Patch),Indicating the spectral position code corresponding to the i-th Token.
The code takes the frequency distribution information as a priori injection transducer, so that the periodic characteristics of the main frequency axis direction and the global structure are reserved in the attention calculation process.
2.2. Self-Attention mechanism (Self-Attention in FIG. 3)
The self-attention structure captures long-range dependencies by computing correlations between different locations in the input sequence and allows each location to be weighted summed based on information from other locations. Specifically, the input is firstly remodeled by reshape, the pixels after the image is segmented are flattened into a Token sequence, Q, K and V are generated through Linear transformation (Linear in fig. 3), then similarity matrixes of Q and K are calculated, attention weight is obtained through calculation through multiplication of a MatMul matrix in fig. 3, softMax normalization is carried out, and finally the output of each position is obtained through weighting of V by the weight. Since the dimensions of the self-attention mechanism itself input and output are usually kept unchanged, we introduce an average pooling layer after self-attention, halving the feature map size to facilitate fusion with features from CNN branches in the context interaction enhanced fusion module model (CCAF).
2.3. Spectrum aggregation attention mechanism (Spectral Aggregation Attention, SAA)
In order to further improve the discrimination capability of the transducer to the main frequency energy region, the invention designs a spectrum energy guided attention aggregation module (SEAM) at the end of Transformer Block, which is used as a weighting characteristic fusion unit before output. First output for each patchAfter spectral transformation, its importance score is estimated by 1-layer MLP:
and carrying out spectrum weighted aggregation on all the token by using the weights to generate a transducer final output characteristic:
The aggregation mechanism strengthens the output contribution of the main frequency band token, and the response enhancement to the main texture frequency band is formed in front of the classification head, so that the perception of the periodic texture difference is improved.
Further, the output after being processed by the spectrum aggregation attention mechanism is remodeled and then subjected to two-dimensional average pooling (AvgPool d in fig. 3).
Transformer spectral consistency regularization
In order to guide the transducer to pay attention to the main frequency energy region in the attention construction process, the invention provides a frequency spectrum energy regulation module which is used for supervising whether the overall frequency response extracted by the transducer is consistent with the real passport frequency spectrum. First, in each layer of the convertors, an intermediate feature is set asPerforming linear projection on all token to obtain spectrum estimation: . And then aggregating all patch spectrums to form a predicted full-graph spectrum distribution: . Finally, average spectrum template with predefined real passport through KL divergence loss And (3) performing supervision constraint:
。
3. Context interaction enhanced fusion module (Contextual Cross-ATTENTIVE FUSION, CCAF)
In order to further improve the feature fusion quality of the CNN branch and the transducer branch, the invention provides a context interaction enhancement fusion module (Contextual Cross-ATTENTIVE FUSION MODULE, abbreviated as CCAF), as shown in FIG. 4. The module does not adopt the traditional simple weighting mode based on global average pooling and a full-connection layer, but introduces a cross-branch attention mechanism, a characteristic interaction modeling and a gating adjustment mechanism to construct a nonlinear and dynamic fusion structure, thereby realizing higher-quality joint expression.
First, a Cross-branch interactive Attention mechanism (Cross-Attention) was introduced. Let CNN feature be Q, transducer feature be K and V, construct a cross-attention output (i.e., CNN attention to transducer):
wherein, the Represents the cross-attention output of CNN to the transducer,Representing the local features of the CNN branches,Representing the global characteristics of the transducer branch,、AndAs a matrix of weights, the weight matrix,Representing the dimension of the K vector, softMax (∈) represents the normalized exponential function. Symmetrically, we also construct a transducer attention output to CNNFor enhancing CNN features. After cross-attention enhancement, we get two context-aware feature representations:
in the fusion phase we do not directly use the weighted linear weights, but rather construct a gating mechanism for adaptively controlling the contributions of the two branches to the final output. The gating is determined by the global context of the fused features, as follows:
The fusion output is:
Where z represents the global context vector, Pooling (accomplished by adaptive two-dimensional average pooling AdaptiveAvgPool d in FIG. 4), g represents gating vector/weight, +.. The structure is able to adaptively control the contribution of different feature sources for each channel compared to the weighted approach of a single weight.
In order to further improve the expression capability of the combined characteristics after the fusion of the CNN and the Transformer in a frequency domain, the invention introduces a fusion-level frequency spectrum consistency regularization loss function in a context interaction enhanced fusion module (CCAF)The spectral structure used to supervise the final fused feature map should be consistent with the frequency distribution of the real passport background image. Fusion-level spectral consistency regularization loss functionThe following are provided:
4. Final classification and loss function
In order to ensure that the model achieves high precision, strong robustness and consistency of a frequency spectrum structure in the classification task of the passport background texture image, the invention designs a multi-target joint loss function framework in model training, combines main classification loss and a plurality of frequency spectrum guiding loss items, and guides a network to learn the real frequency distribution structure of the passport image from two aspects of global and local. The overall loss function consists of the following parts:
wherein, the 、、Is the weight of the parameter to be exceeded,The number of corresponding blocks in fig. 1 (b=4 in this embodiment) is shown.Representing a loss of spectral consistency of the CNN branches,Representing a transform branch spectral consistency loss,Representing a loss of spectral consistency of the fused features,And the main classification Loss is expressed and used for completing the final classification task of the passport background texture image, and Focal Loss is adopted to reduce the Loss contribution of the easy-classification samples and promote the attention degree to the difficult-classification samples, wherein the definition is as follows:
, wherein, Is the model pairThe probability that the individual samples are predicted to be of the correct class; Is a real label; Is a modulation factor; Is the cross entropy loss.
In another exemplary embodiment, based on the same inventive concept as the method embodiment, there is provided a passport background texture image identification system, comprising:
The image input module is used for inputting a passport background texture image to be identified;
The image recognition module is used for recognizing the passport background texture image to be recognized through a double-branch coupled time convolution network TCN and outputting a classification result of the passport background texture image, wherein the double-branch coupled time convolution network TCN comprises:
a CNN branch for extracting local texture features of the passport background texture image to be identified;
A transducer branch for extracting global texture features of a passport background texture image to be identified;
The context interaction enhancement fusion module is used for fusing the characteristics extracted by the CNN branch and the transducer branch, constructing bidirectional attention interaction between the CNN branch and the transducer branch, and introducing a gating mechanism to adaptively fuse the characteristics after the context enhancement.
In another exemplary embodiment, based on the same inventive concept as the method embodiment, a computer readable storage medium is provided, which stores a computer program, which when executed by a processor, implements a passport background texture image identification method provided by the embodiment of the present invention. Based on such understanding, the technical solution of the present embodiment may be essentially or a part contributing to the prior art or a part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. The storage medium includes a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc., which can store the program code.
In another exemplary embodiment, based on the same inventive concept as the method embodiment, an electronic device is provided, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform a method for identifying a passport background texture image provided by the embodiment of the present invention.
The processor may be a single or multi-core central processing unit or a specific integrated circuit, or one or more integrated circuits configured to implement the invention.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in tangible computer software or firmware, in computer hardware including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Processors suitable for the execution of a computer program include, by way of example, general and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
It should be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing detailed description of the invention is provided for illustration, and it is not to be construed that the detailed description of the invention is limited to only those illustration, but that several simple deductions and substitutions can be made by those skilled in the art without departing from the spirit of the invention, and are to be considered as falling within the scope of the invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511199053.5A CN120726659B (en) | 2025-08-26 | 2025-08-26 | Passport background texture image recognition method, system, storage medium and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202511199053.5A CN120726659B (en) | 2025-08-26 | 2025-08-26 | Passport background texture image recognition method, system, storage medium and device |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN120726659A CN120726659A (en) | 2025-09-30 |
| CN120726659B true CN120726659B (en) | 2025-12-09 |
Family
ID=97170506
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202511199053.5A Active CN120726659B (en) | 2025-08-26 | 2025-08-26 | Passport background texture image recognition method, system, storage medium and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN120726659B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112819039A (en) * | 2021-01-14 | 2021-05-18 | 华中科技大学 | Texture recognition model establishing method based on multi-scale integrated feature coding and application |
| CN117593798A (en) * | 2023-11-28 | 2024-02-23 | 齐鲁工业大学(山东省科学院) | A method for detecting fake faces based on dual-branch feature enhancement |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115713776B (en) * | 2022-11-10 | 2025-10-31 | 华中科技大学 | Universal certificate structured recognition method and system based on deep learning |
| CN115797931B (en) * | 2023-02-13 | 2023-05-30 | 山东锋士信息技术有限公司 | Remote sensing image semantic segmentation method and device based on double-branch feature fusion |
| CN119027796B (en) * | 2024-10-28 | 2025-02-25 | 河南科技学院 | A double-branch network method for underwater dam crack detection |
| CN120259657A (en) * | 2025-03-20 | 2025-07-04 | 河南科技大学 | A real-time semantic segmentation method based on dual-branch encoding-decoding network |
-
2025
- 2025-08-26 CN CN202511199053.5A patent/CN120726659B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112819039A (en) * | 2021-01-14 | 2021-05-18 | 华中科技大学 | Texture recognition model establishing method based on multi-scale integrated feature coding and application |
| CN117593798A (en) * | 2023-11-28 | 2024-02-23 | 齐鲁工业大学(山东省科学院) | A method for detecting fake faces based on dual-branch feature enhancement |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120726659A (en) | 2025-09-30 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN112818862B (en) | Face tampering detection method and system based on multi-source clues and mixed attention | |
| Li et al. | Detection-friendly dehazing: Object detection in real-world hazy scenes | |
| CN112766158B (en) | Face occlusion expression recognition method based on multi-task cascade | |
| Zhou et al. | Salient object detection in stereoscopic 3D images using a deep convolutional residual autoencoder | |
| CN110263912B (en) | An Image Question Answering Method Based on Multi-object Association Deep Reasoning | |
| CN114119997B (en) | Training method, device, server and storage medium for image feature extraction model | |
| CN115984700A (en) | Remote sensing image change detection method based on improved Transformer twin network | |
| Ma et al. | Cascade transformer decoder based occluded pedestrian detection with dynamic deformable convolution and Gaussian projection channel attention mechanism | |
| Gajjala et al. | MERANet: Facial micro-expression recognition using 3D residual attention network | |
| Niu et al. | Boundary-aware RGBD salient object detection with cross-modal feature sampling | |
| Wang et al. | Image target recognition based on improved convolutional neural network | |
| Talemi et al. | CATFace: Cross-attribute-guided transformer with self-attention distillation for low-quality face recognition | |
| Sun et al. | A semantic segmentation method integrated convolutional nonlinear spiking neural model with transformer | |
| Yan et al. | Iris-LAHNet: a lightweight attention-guided high-resolution network for iris segmentation and localization | |
| Li et al. | A discriminative self‐attention cycle GAN for face super‐resolution and recognition | |
| CN117011918A (en) | Method for constructing a live face detection model based on linear attention mechanism | |
| Xiang et al. | Frequency transformer with local feature enhancement for improved vehicle re-identification: H. Xiang et al. | |
| Gao et al. | Contrastive pedestrian attentive and correlation learning network for occluded person re-identification | |
| Li et al. | Deep reverse attack on SIFT features with a coarse-to-fine GAN model | |
| CN120726659B (en) | Passport background texture image recognition method, system, storage medium and device | |
| CN119625436B (en) | Interpretable fine-granularity image recognition method, device, equipment and medium | |
| Ding et al. | Contour wavelet diffusion: A fast and high‐quality image generation model | |
| Xu et al. | The novel instance segmentation method based on multi-level features and joint attention | |
| Wang et al. | Super-Resolution pedestrian re-identification method based on bidirectional generative adversarial network | |
| Zhou et al. | Turbidity–Similarity Decoupling: Feature-Consistent Mutual Learning for Underwater Salient Object Detection |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |