[go: up one dir, main page]

CN119399068B - Double-branch multi-scale image defogging method based on high-quality codebook - Google Patents

Double-branch multi-scale image defogging method based on high-quality codebook

Info

Publication number
CN119399068B
CN119399068B CN202411445292.XA CN202411445292A CN119399068B CN 119399068 B CN119399068 B CN 119399068B CN 202411445292 A CN202411445292 A CN 202411445292A CN 119399068 B CN119399068 B CN 119399068B
Authority
CN
China
Prior art keywords
image
codebook
encoder
branch
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411445292.XA
Other languages
Chinese (zh)
Other versions
CN119399068A (en
Inventor
尹学辉
武沛鑫
李泽宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202411445292.XA priority Critical patent/CN119399068B/en
Publication of CN119399068A publication Critical patent/CN119399068A/en
Application granted granted Critical
Publication of CN119399068B publication Critical patent/CN119399068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the technical field of image processing, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the steps of obtaining an original image super-resolution reconstruction data set; training VQGAN network model by using original image super-resolution reconstruction data set to obtain network structure of codebook and VQ decoder and its correspondent parameters, obtaining original image defogging data set, training double-branch multi-scale image defogging network model by using original image defogging data set, inputting foggy image into trained double-branch multi-scale image defogging network model to obtain clear foggy image.

Description

Double-branch multi-scale image defogging method based on high-quality codebook
Technical Field
The invention relates to the technical field of computer image restoration, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook.
Background
Mist is an aerosol system composed of a large number of tiny water droplets suspended in near-ground air, and is one of the main causes of image blurring, color distortion, and contrast reduction.
In the extreme weather frequency in the years, many cities are affected by foggy days, seasonal frequent situations are presented, the foggy days are affected, the accuracy of image acquisition information acquired by outdoor visual systems such as automatic driving, video monitoring, military reconnaissance and remote sensing images is reduced, the quality of images is rapidly deteriorated along with the foggy aggravation and the occurrence of uneven foggy, color distortion, characteristic blurring, contrast reduction and the degradation of other visual quality are caused, objects and backgrounds in the images cannot be identified, the execution and processing of subsequent various visual tasks such as semantic segmentation in computer vision, the effect of tasks such as target detection and the like are seriously affected, and therefore, pretreatment is needed for the images, so that the influence of the foggy on the imaging quality of the images is reduced.
The current image defogging main stream technical scheme based on deep learning comprises the following steps:
Based on a physical model and priori knowledge, a deep learning method is utilized, image enhancement preprocessing operation is assisted, a convolutional neural network is adopted to estimate parameters in an atmospheric scattering model in many researches, in order to avoid accumulated errors in the parameter estimation process, an end-to-end network is proposed, a foggy image is directly estimated and generated by a foggy image, the foggy image is usually obtained by the physical model mainly after the atmospheric scattering model is used for processing and degrading the clear foggy image, the atmospheric scattering model cannot perfectly describe the formation process of all foggies, therefore, the image obtained by artificial processing cannot well replace a real foggy image, the trained model has poor generalization capability and is often invalid when being used for processing the real image, and therefore, the methods can finally obtain excellent performance on a synthetic data set, but the performance on the real data set is still to be improved.
Disclosure of Invention
In order to solve the technical problems, the invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the following steps:
S1, acquiring an original image super-resolution reconstruction data set, wherein the original image super-resolution reconstruction data set comprises an original clear image;
S2, training VQGAN a network model by utilizing an original image super-resolution reconstruction data set to obtain a codebook, a network structure of a VQ decoder and corresponding parameters thereof;
the VQGAN network model comprises a VQ encoder, a codebook and a VQ decoder;
s3, acquiring an original image defogging data set, wherein the original image defogging data set comprises an original foggy image and a corresponding clear defogging image;
S4, training a double-branch multi-scale image defogging network model by using an original image defogging data set;
The dual-branch multi-scale image defogging network model is divided into a priori matching branch and a channel attention branch, wherein the priori matching branch comprises a VQGAN network model with fixed parameters, a pyramid cavity neighborhood attention encoder and an enhancement decoder, and the structure of the prior matching branch comprises a VQ encoder, a pyramid cavity neighborhood attention encoder, a codebook matching module with fixed parameters, a VQ decoder with fixed parameters and an enhancement decoder in sequence, and the channel attention branch comprises a convolution with 3 multiplied by 3 and 4 residual channel attention layers;
S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image.
Compared with the prior art, the method has the advantages that the discrete codebook is obtained by training clear haze-free images, the high-quality priori knowledge with the original image colors and structures is packaged, then a double-branch neural network structure, namely a priori matching branch and a channel attention branch, is constructed, the global features of the haze images are extracted by utilizing the neighborhood attention based on a Transformer and the channel attention based on convolution, complex interaction features between the haze areas and bottom scenes are learned, the features extracted by the two branches are fused through a feature fusion module, the region affected by haze in the image is replaced through a controllable distance recalculation operation in the matching process of the high-quality priori constraint codebook and the haze image features, and the end-to-end image defogging process is realized by reconstructing the original haze images, so that the definition and the recognizability of the haze images are improved.
Drawings
FIG. 1 is a flow chart of a dual-branch multi-scale image defogging method based on a high-quality codebook;
FIG. 2 is a schematic diagram of a dual-branch multi-scale defogging network structure based on a high quality codebook according to the present invention;
FIG. 3 is a schematic diagram of the structure of a VQ encoder and a VQ decoder according to the present invention;
FIG. 4 is a diagram of a pyramid hole neighborhood attention encoder of the present invention;
Fig. 5 is a block diagram of an enhancement decoder according to the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making creative efforts based on the embodiments of the present invention are included in the protection scope of the present invention.
The invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which is shown in figure 1, wherein the whole image is trained in two stages;
the first stage is to use VQGAN to train and obtain a discrete codebook of detail texture characteristics which is common to the reaction clear image in advance, and in this stage, the network is composed of a VQ encoder, a codebook and a VQ decoder, and the training aim is to obtain the codebook which stores the high-quality clear image characteristics and the corresponding VQ decoder;
s1, acquiring an original image super-resolution reconstruction data set;
The original image super-resolution reconstruction data set comprises an original clear image;
In the specific implementation, the data set is Flickr2K, and the data set has 1000 clear pictures, including contents of characters, buildings, animals, objects and the like, and adopts a low-resolution subset which is sampled 4 times by a bicubic method, wherein the specific resolution is about 512 multiplied by 349, 800 pictures are used as training, 100 pictures are used as verification, and 100 pictures are used as testing;
S2, training VQGAN a network model by utilizing an original image super-resolution reconstruction data set to obtain a codebook, a network structure of a VQ decoder and corresponding parameters thereof;
Wherein, as shown in fig. 2, VQGAN network model includes VQ encoder, codebook, VQ decoder, wherein VQ encoder and VQ decoder structure is shown in fig. 3;
Preferably, training VQGAN the network model in S2 includes:
S21, inputting an original clear image x into a VQ encoder to obtain a potential feature map z;
Specifically, the VQ encoder and decoder adopt a network architecture based on UNet, the first half is feature extraction, the second half is upsampling, unet has been proved to perform well in the fields of image classification, segmentation and the like, and since some edge features are lost while the encoder downsamples and refines the features, a plurality of residual structures are adopted, and one retrieval of the edge features is realized through feature concatenation;
S22, matching the potential feature map z to the nearest element in the codebook so as to obtain a codebook-passing discrete feature map z q;
The codebook is a discrete codebook (Discrete codebook), the discrete codebook compresses detail texture characteristics of an image and plays a key role in image reconstruction, the latest research result VQGAN in the field of image generation is adopted and is modified to a certain extent, specific technical details are shown in ESSER P,ROMBACH R,OMMER B.Taming transformers for high-resolution image synthesis[C].IEEE International Conference on Computer Vision and Pattern Recognition,Virtual Event,2021:12873-12883,, semantic contained in an defogging image is compressed through training on the defogging image, a codebook with rich semantics is built through model self-supervision learning, the compression rate is improved, and meanwhile, good perception quality can be kept, so that the codebook is used as priori knowledge of the defogging image and is used as constraint conditions when a subsequent defogging model generates the defogging image;
The mathematical representation of the codebook is as Given a high quality image x, as input to the VQ encoder E vq, a latent feature map z is output, and then each pixel z ij of z is matched to the nearest element in the codebook, thereby obtaining a codebook-passed discrete feature mapThe discretized features are then input to VQ decoder D vq, which in turn obtains a processed image y, the whole process can be expressed as follows:
zij=Evq(xij)
Where Z ij denotes a pixel at the potential feature map Z position ij, E vq () denotes a VQ encoder, x ij denotes a pixel at the high quality image x position ij, Representing the codebook-passing discrete feature map, M () represents the codebook matching operation, |z-Z k || represents the distance between the foggy image discretization feature Z and the codebook code Z k, argmin () represents the minimum function of the distance, y ij represents the pixel at the generation image position ij,
Notably, the discrete operations in between the encoder and decoder during training of the network are not counter-propagating through the gradient, since the operations are not differentiable, so that only the gradient is copied from the decoder to the encoder during counter-propagating, so that the model can be trained end-to-end through the loss function;
Features of the clear images are compressed into short vectors through discretization operation and stored in a codebook, so that the discrete codebook compresses detailed texture features of the images, plays a key role in image reconstruction, and the reconstructed images use a discriminator to judge the authenticity of the images so as to assist the network to learn further until the discriminator cannot judge the authenticity of the images, and then training is completed;
S23, constructing a codebook discretization loss function according to the potential feature map z and the discrete feature map z q, and updating parameters of a VQ encoder and a codebook of the model by taking the minimum loss function as an optimization target;
Wherein the codebook discretization loss function is defined as follows:
Wherein L Z represents a codebook discretization loss function, Z q represents a discrete feature map, Z represents a potential feature map, beta represents a weight factor, sg [ Z q ] represents a stop gradient-gradient, and the loss function is mainly used for measuring discretization loss generated between the output Z of an encoder and a discrete vector Z q;
s24, sending the discrete feature map z q to a VQ decoder for decoding to obtain a reconstructed clear image y;
s25, constructing a loss function of the VQ decoder according to the reconstructed clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;
Details of this step are given in S46;
Constructing a double-branch multi-scale neural network in the second stage, extracting global features of a foggy image and learning complex interaction features between a foggy region and a bottom scene by using the neighborhood attention based on a Transformer and the channel attention based on convolution, assisting the reconstruction process of the foggy image by means of a discrete codebook obtained from a clear image in the first stage, specifically, matching the corresponding discrete codebook and replacing the region influenced by foggy in the image by a controllable distance recalculation operation, and further achieving the defogging effect;
S3, acquiring an original image defogging data set;
The original image defogging data set comprises an original foggy image and a corresponding clear foggy image;
In practice, the data sets used are O-HAZE, I-HAZE, DENSE-HAZE, NH-HAZE-20, NH-HAZE-21, NH-HAZE-23, wherein O-HAZE and I-HAZE comprise a fog of 45 pairs of outdoor scenes and 35 pairs of indoor scenes, respectively, and the corresponding clear images, and DENSE-HAZE is characterized by a DENSE uniform HAZE scene, and the data sets comprise 55 pairs of true DENSE fog images and corresponding non-fog images of various outdoor scenes, the DENSE fog images being generated by a professional HAZE machine, the generated DENSE fog images being nearly indistinguishable from the objects originally present in the images, the defogging difficulty being very large compared to a conventional data set, the NH-HAZE data set being a set of true non-uniform fog images and corresponding non-fog image pairs, wherein the non-uniform fog is introduced by simulation of a real day condition by a fog generator, and is continuously divisible into NH-HAZE-20, NH-HAZE 21, NH-zze-25, and a more detailed map of the images of the specific HAZE images of the specific image pair of the specific HAZE machine, respectively, and the detailed map 1-25, and the detailed map of the specific HAZE images are shown in the detailed map, respectively:
table 1:
S4, training a double-branch multi-scale image defogging network model by using an original image defogging data set;
The defogging network model of the double-branch multi-scale image is divided into a priori matching branch and a channel attention branch;
The prior matching branch comprises a VQ encoder, a pyramid hole neighborhood attention encoder, a fixed parameter codebook matching module, a fixed parameter VQ decoder and an enhancement decoder, wherein the pyramid hole neighborhood attention encoder is shown in figure 4, and the enhancement decoder is shown in figure 5;
The channel attention branches include a3 x 3 convolution, 4 residual channel attention layers;
and finally fusing the results of the two branches through a feature fusion structure;
Preferably, training the dual-branch multi-scale image defogging network model in S4 includes:
S41, inputting an original foggy image x into a VQ encoder to roughly extract features to obtain a preliminary feature F 1, and then inputting the preliminary feature F 1 into a pyramid cavity neighborhood attention encoder to obtain an advanced feature F 2;
the VQ encoder performs well when encoding a clear image, but has a poor capability of dense fog or nonuniform fog in encoding, mainly because in a defogging task, the encoder is required to extract general structural texture features in the image and distinguish a fog region in the image, and the VQ encoder has a shallower network architecture and cannot finish the task well;
In order to fully extract global features such as texture and structure of a foggy image, the invention designs an encoder based on hole neighborhood attention in a prior verification matching branch, wherein the hole neighborhood attention is a self-attention variant in Vision Transformer, and is an effective and expandable visual sliding window attention mechanism, and meanwhile, the visual downstream performance exceeds Vision Transformer and Swin transform;
Neighborhood attention transducer consists of a Multi-layer perceptron (Multi-layer perception, MLP), normalization layer (LayerNorm, LN), residual connection and Multi-head neighborhood attention (Multi-head Neighborhood Attention, NA), each pixel only focuses on the neighborhood of 1 pixel around itself when the neighborhood size is minimum, and neighborhood attention output and self-attention are equal when the neighborhood size reaches maximum;
Not only does neighborhood attention reduce computational costs compared to self-attention, but introduces local inductive bias like convolution, in particular NA is a pixel-by-pixel operation, locating self-attention (self attention, SA) to nearest neighbor pixels, thus neighborhood attention has linear temporal and spatial complexity compared to the quadratic complexity of self-attention;
In the encoder of the hole neighborhood attention designed by the invention, four feature graphs with different resolutions are obtained through a serializer and two times of downsampling, the feature information of each layer in front is used as the input of the next layer through cascading operation by utilizing a pyramid structure, and the features of different layers are aggregated, so that the feature reuse of different scales is realized;
Therefore, the invention adopts a neighborhood attention mechanism, focuses on the global features of the image, and utilizes high-quality prior to perform feature matching, thereby improving the generalization of the network;
firstly, roughly extracting image position and structure information by using a VQ encoder, then serializing shallow features into an input neighborhood attention transducer block by using an overlapping serializer (Overlapping Tokenizer), connecting a downsampler behind a second neighborhood attention transducer block, reducing the space size to be half of the original space size by using the downsampler, and doubling the number of channels, so that feature images with different scales are generated;
In order to fuse feature graphs of different scales, the invention designs a pyramid-shaped feature aggregation mode, a plurality of dense connection operations are added in the feature aggregation mode, because the direct addition operation can lose some original feature information in feature fusion, and the cascade operation is a lossless operation in a strict sense, the dense connection is adopted in the gradual processing process of the feature graphs, the feature information of each layer in front is used as the input of the next layer through the addition and cascade operation, so that the features of different levels are aggregated, the feature reuse of different scales is realized, the global information in the features is optimized, and the fusion of the features of different levels is allowed through a plurality of residual connections, so that the multi-scale features of the image mist distribution are extracted, and the subsequent high-quality prior matching is facilitated;
S42, inputting the advanced feature F 2 into a codebook matching module with fixed parameters for matching to obtain a matched feature F 3;
In the process of reconstructing an image by using a high-quality codebook, discrete codes output by an encoder are difficult to match to the corresponding high-quality codebook, mainly because the image is seriously degraded, and a foggy image can have a problem of domain gaps to cause inconsistent data distribution, so that the distance between the output of the encoder and the codebook needs to be matched, and the problem caused by the domain gaps is reduced by utilizing the matching operation of a controllable distance recalculation method, so that a better reconstruction effect is achieved;
calculating the distance between the foggy image discrete code and each code in the codebook to find the codebook code with the minimum distance, and then adjusting the finally calculated distance through a weight function F, thereby obtaining a matching formula:
F(fk,α)=fk×eα
Wherein M (z) represents a matching process of the codebook matching module, F () represents a weight function generated according to the frequency difference, F k represents the frequency difference of the foggy image and the clear image on codebook activation, α represents a parameter for adjusting the degree of defogging, |z-z k ||represents a distance between the foggy image discretization feature z and the codebook code z k, and argmin () represents a minimum function taking the distance.
The method comprises the steps that the activation frequency difference f k and alpha of a codebook code are needed to be solved in a matching formula, the activation frequency difference of each code in the codebook is set to be 0 when defogging network training begins, if a current foggy image is not matched with the codebook code and a clear image is matched with the codebook code, the frequency difference on the code is updated, after multiple training, the network learns an optimal value, the difference between the codes of the encoder and the clear foggy image is represented by the difference between two probability distributions according to the value of alpha, the Coebeck-Lybber degree of scattering (Kullback-Leibler Divergence, KL degree of scattering) is an index for measuring the similarity of the two probability distributions, the larger the similarity is, the value of KL degree of scattering is small, the probability distribution of the codebook activation of the clear image is P c, the probability distribution corresponding to the foggy image is P h, the probability distribution of the foggy image can be adjusted through alpha, and alpha of the two different domains is the optimal P h(x=zk, and the best value of the foggy image is calculated in the field of alpha is 35;
S43, inputting an original clear fog-free image into the pre-trained VQGAN network model in the S2, and sequentially passing through a VQ encoder and a codebook to obtain an intermediate feature F 4;
details of this step are given in S2;
S44, constructing an encoder loss function of a double-branch multi-scale image defogging network model according to the preliminary feature F 1, the advanced feature F 2, the matched feature F 3 and the intermediate feature F 4, and updating parameters of a VQ encoder and a pyramid cavity neighborhood attention encoder of the model by taking the minimum loss function as an optimization target;
to help the encoder output match to the correct high quality codebook prior at a later step, we need to purposefully make the encoder output characteristics meet a standard normal distribution consistent with training the high quality prior;
Assuming that the foggy image is input as x h, the foggy image is input as x gt, the defogging network encoder is E, the encoder used for training the codebook is E vq, we can obtain the intermediate feature z h=E(xh of the foggy image processed by the encoder E), and the intermediate feature of the foggy image processed by the encoder E vq
In the process of controlling the image generation, we also need to control the style difference between the generated image and the haze-free image, so that the style loss is measured by using a psi, i.e. Gram matrix, and whether the generated features are real or not is judged by using a discriminator D when a codebook is trained, so that the final encoder loss is as follows:
Where L VQ represents the encoder loss function of the dual-branch multi-scale image defogging network model, z h represents the intermediate features of the foggy image, Representing intermediate features of the haze-free image, lambda style and lambda adv representing first and second hyper-parameters for adjusting weights of different losses, ψ () representing a matrix for measuring a style loss, E [ ] representing an encoder, D () representing a discriminator,Representing the ith intermediate feature of the hazy image.
S45, sequentially sending the characteristic F 3 to a fixed-parameter VQ decoder and an enhancement decoder for decoding to obtain an intermediate characteristic F 5;
Specifically, the output result obtained by the VQ decoder is easy to lack detail information in some places with deeper haze, and the image structure and texture are more fuzzy, in order to improve the decoding capability of detail features of a foggy image, the invention designs an enhancement decoder based on multiple attentions in a first-pass matching branch, combines channel attentions and pixel attentions, and finally ensures that feature details with different scales are embedded into a final result by an enhancement block based on pyramid pooling;
S46, inputting an original foggy image into a channel attention branch to obtain an intermediate feature F 6;
The attention branches of the channels are increased to pay attention to non-uniform haze and dense fog areas with obvious brightness changes, the over-enhancement problem is avoided, the overall reconstruction performance of images is improved, the attention mechanism can enable a network to flexibly pay attention to the haze characteristics so as to reconstruct high-quality haze-free images, the brightness of the shielding areas of the non-uniform haze and the dense fog can be obviously increased, and the restoration of the areas with obvious brightness changes such as sky removal, snow and the like is further paid attention to, so that the over-enhancement problem can be avoided, and the overall reconstruction performance of the images is improved;
s47, adding the intermediate features F 5、F6 through a channel, and then obtaining a generated defogging image y through a feature fusion module;
the final feature fusion part adopts a feature fusion module consisting of a reflection filling layer, a convolution layer and a Tanh activation function, and fuses the outputs of the two branches;
s48, constructing a loss function of the VQ decoder according to the defogged clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;
at this stage it is necessary to determine whether the image eventually generated by the whole network has correctly completed the defogging task, noting that the parameters of the encoder and decoder are not updated in the same dimension due to the separate calculation of the losses, so that the gradient at this stage cannot be counter-propagated to the encoder;
The loss of the other parts of the network is calculated by adopting the following loss combination;
Smoothing L1 loss:
The advantage of the L1 loss function (also called average absolute error) is combined with the advantages of L1 and L2 loss, the average error amplitude of the distance between the predicted value and the true value is measured, and the advantage of the L1 loss function is that the derivative at the 0 point is continuous, so that the solving efficiency is higher, and the convergence speed is faster;
Wherein x i and y i respectively represent the ith pixel of the clear image and the foggy image, N is the total number of pixel points, f θ (to) represents the defogging network, and f θ(xi) represents the ith pixel of the image reconstructed by the defogging network;
MS-SSIM loss:
MS-SSIM loss based on the assumption that human eyes acquire image structure information can provide reference on an image quality perception standard, and setting O and G to respectively represent two windows with ith pixel in a defogging image and a real image as a center, and applying Gaussian filters on the two windows to generate corresponding mean (mu OG), standard deviation sigma O、σG and covariance sigma OG, wherein the MS-SSIM loss can be expressed as follows:
Wherein, C 1,C2 is two very small values, which is helpful for the stability of the score in the formula;
Perceptual loss:
The perception loss is measured by adopting the VGG16 pre-trained on the ImageNet data set, wherein the ImageNet data set is a computer vision data set which is created by professor Li Feifei of Stanford university and the like, the data set comprises more than 1400 ten thousand pictures and more than 2 ten thousand classifications, and the VGG16 pre-trained on the data set has good perception effect and can help the model to reconstruct finer details;
Let x and y represent the foggy image and clean image respectively, f θ (x) represents the image reconstructed by defogging network, Φ j (·) represents the feature map processed by the j-th layer of VGG16, L 2 loss is used to measure the distance between the reconstructed image and the clean image feature map, N represents the number of feature maps used for calculating the perception loss;
Countering losses:
since pixel-based loss functions do not provide adequate supervision over small data sets, countermeasures against the above-mentioned loss are added in order to remedy the drawbacks of the loss;
Wherein D represents a discriminator (Discriminator) employed in training the codebook, M represents the number of sample data;
Total loss:
the weights of the preliminary prescribed smoothing l 1 loss function, MS-SSIM loss, perceived loss, counterloss are 1, 0.5, 0.01, 0.0005, respectively, and then are adjusted according to the experimental conditions, so that the total image reconstruction loss is expressed as follows:
L=Ll1+0.5LMS-SSIM+0.01Lperc+0.0005Ladv
S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image;
Preferably, during training, the input image is randomly cropped to 256×256 size and the dataset is expanded by scaling, random rotation and flipping, using Adam optimizer, defaults β 1 and β 2 are 0.9 and 0.99, respectively, initial learning rate is 0.0001, batch size is 1, and the model is implemented on an NVIDIA V100 Tensor Core based on Pytorch.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (7)

1.一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,包括:1. A dual-branch, multi-scale image dehazing method based on a high-quality codebook, characterized in that it includes: S1:获取原始图像超分辨率重建数据集,所述原始图像超分辨率重建数据集包括原始的清晰图像;S1: Obtain the original image super-resolution reconstruction dataset, which includes the original sharp image; S2:利用原始图像超分辨率重建数据集对VQGAN网络模型进行训练,获得码本、VQ解码器的网络结构和其对应的参数;S2: Train the VQGAN network model using the original image super-resolution reconstruction dataset to obtain the codebook, the network structure of the VQ decoder, and its corresponding parameters; 所述VQGAN网络模型包括,VQ编码器、码本、VQ解码器;The VQGAN network model includes a VQ encoder, a codebook, and a VQ decoder; S3:获取原始图像去雾数据集,所述原始图像去雾数据集包括:原始有雾图像、对应的清晰无雾图像;S3: Obtain the original image dehazing dataset, which includes: the original hazy image and the corresponding clear, hazy-free image; S4:利用原始的图像去雾数据集对双分支多尺度图像去雾网络模型进行训练;S4: Train the dual-branch multi-scale image dehazing network model using the original image dehazing dataset; 所述双分支多尺度图像去雾网络模型,分为先验匹配分支和通道注意力分支;其中,所述先验匹配分支包括固定参数后的VQGAN网络模型、金字塔空洞邻域注意力编码器、增强解码器,其结构组成依次为VQ编码器、金字塔空洞邻域注意力编码器、固定参数的码本匹配模块、固定参数的VQ解码器、增强解码器;所述通道注意力分支包括一个3×3的卷积、4个残差通道注意力层;并最终通过一个特征融合结构融合两个分支的结果;The dual-branch multi-scale image dehazing network model is divided into a prior matching branch and a channel attention branch. The prior matching branch includes a VQGAN network model with fixed parameters, a pyramid-shaped hollow neighborhood attention encoder, and an enhancement decoder. Its structure consists of, in order, a VQ encoder, a pyramid-shaped hollow neighborhood attention encoder, a codebook matching module with fixed parameters, a VQ decoder with fixed parameters, and an enhancement decoder. The channel attention branch includes a 3×3 convolution and four residual channel attention layers. Finally, the results of the two branches are fused through a feature fusion structure. 所述金字塔空洞邻域注意力编码器,包括:序列化器、邻域注意力Transformer块、下采样器;The pyramid-shaped cavity neighborhood attention encoder includes: a serializer, a neighborhood attention Transformer block, and a downsampling unit; 所述金字塔空洞邻域注意力编码器处理过程包括以下步骤:The pyramid cavity neighborhood attention encoder processing procedure includes the following steps: 步骤1:输入数据通过序列化器以及两次降采样获取四个不同分辨率大小的特征图,采用金字塔结构通过级联操作将前面每一层的特征信息作为下一层的输入,聚合不同层级的特征;Step 1: The input data is processed by a serializer and downsampled twice to obtain four feature maps of different resolutions. A pyramid structure is used to aggregate the features of each layer as the input of the next layer through cascading operations. 步骤2:采用多个残差连接,允许不同级别特征进行融合,从而提取图像雾气分布的多尺度特征,得到输出特征;Step 2: Multiple residual connections are used to allow the fusion of features at different levels, thereby extracting multi-scale features of fog distribution in the image and obtaining the output features; S5:将有雾图像输入到训练好的双分支多尺度图像去雾网络模型中得到生成的清晰无雾图像。S5: Input the foggy image into the trained bi-branch multi-scale image dehazing network model to obtain a clear, fog-free image. 2.根据权利要求1所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,在步骤S2中对VQGAN网络模型训练的步骤为:2. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the step of training the VQGAN network model in step S2 is as follows: S21:将原始的清晰图像x输入到基于UNet的网络架构的VQ编码器Evq中进行提取和采样,且在采样过程中采用多个残差结构进行边缘特征的拼接,得到潜在特征图z;S21: Input the original clear image x into the VQ encoder E vq based on the UNet network architecture for extraction and sampling, and use multiple residual structures to stitch together edge features during the sampling process to obtain the latent feature map z. S22:将潜在特征图z匹配到码本中最近的元素,从而得到经过码本离散特征图zqS22: Match the latent feature map z to the nearest element in the codebook to obtain the discrete feature map z q after passing through the codebook; S23:根据潜在特征图z和离散特征图zq,构建码本离散化损失函数,以损失函数最小为优化目标对模型的VQ编码器、码本的参数进行更新;S23: Based on the latent feature map z and the discrete feature map zq , construct the codebook discretization loss function, and update the parameters of the VQ encoder and codebook of the model with the minimum loss function as the optimization objective; S24:将离散特征图zq送入VQ解码器解码,得到重建后的清晰图像y;S24: The discrete feature map z q is fed into the VQ decoder to obtain the reconstructed clear image y; S25:根据重建后的清晰图像y和原始的清晰图像x,构建VQ解码器的损失函数,以损失函数最小为优化目标对模型的VQ解码器的参数进行更新。S25: Based on the reconstructed sharp image y and the original sharp image x, construct the loss function of the VQ decoder, and update the parameters of the VQ decoder of the model with the goal of minimizing the loss function. 3.根据权利要求1或2所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,所述码本的数学表示式为:3. A dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1 or 2, characterized in that the mathematical expression of the codebook is: 其中,Z表示码本,zk表示码本编码,K表示码本编码数量,表示码本编码集。Where Z represents the codebook, z <sub>k</sub> represents the codebook encoding, and K represents the number of codebook encodings. This indicates the codebook encoding set. 4.根据权利要求2所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,所述码本离散化损失函数,包括:4. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 2, characterized in that the codebook discretization loss function includes: 其中,LZ表示码本离散化损失函数,sg[]表示停止梯度stop-gradient,zq表示离散特征图,z表示潜在特征图,β表示权重因子。Where L Z represents the codebook discretization loss function, sg[] represents the stopping gradient, z q represents the discrete feature map, z represents the latent feature map, and β represents the weight factor. 5.根据权利要求1所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,在S4中对双分支多尺度图像去雾网络模型进行训练步骤包括以下步骤:5. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the training step of the dual-branch multi-scale image dehazing network model in S4 includes the following steps: S41:将原始的有雾图像x输入到VQ编码器中粗略的提取特征,得到初步特征F1,随后将初步特征F1输入到金字塔空洞邻域注意力编码器得到高级特征F2S41: Input the original foggy image x into the VQ encoder to roughly extract features and obtain preliminary features F1 . Then input the preliminary features F1 into the pyramid hole neighborhood attention encoder to obtain advanced features F2 . S42:将高级特征F2输入到固定参数的码本匹配模块进行匹配,获得匹配后的特征F3S42: Input the high-level feature F2 into the codebook matching module with fixed parameters for matching, and obtain the matched feature F3 ; S43:将原始的清晰无雾图像输入到S2中预训练好的VQGAN网络模型中,依次经过VQ编码器、码本,获得中间特征F4S43: Input the original clear, fog-free image into the pre-trained VQGAN network model in S2, and pass it through the VQ encoder and codebook in sequence to obtain the intermediate feature F4 ; S44:根据初步特征F1、高级特征F2和匹配后的特征F3和中间特征F4,构建双分支多尺度图像去雾网络模型的编码器损失函数,以损失函数最小为优化目标对模型的VQ编码器、金字塔空洞邻域注意力编码器的参数进行更新;S44: Based on the initial feature F1 , high-level feature F2 , and matched features F3 and intermediate features F4 , construct the encoder loss function of the dual-branch multi-scale image dehazing network model, and update the parameters of the VQ encoder and pyramid hollow neighborhood attention encoder of the model with the minimum loss function as the optimization objective. S45:将特征F3依次送入固定参数的VQ解码器、增强解码器进行解码,得到中间特征F5S45: Feed feature F3 sequentially into the VQ decoder and the enhancement decoder with fixed parameters to decode, and obtain intermediate feature F5 ; S46:将原始的有雾图像输入到通道注意力分支得到中间特征F6S46: Input the original foggy image into the channel attention branch to obtain intermediate features F6 ; S47:将中间特征F5、F6通过通道相加操作,随后过特征融合模块得到生成的去雾图像y;S47: The intermediate features F5 and F6 are added together through the channel, and then the generated dehazed image y is obtained through the feature fusion module; S48:根据去雾后的清晰图像y和原始的清晰图像x,构建双分支多尺度图像去雾网络模型剩余部分的损失函数,包括平滑l1损失函数、MS-SSIM损失、感知损失、对抗损失,以损失函数最小为优化目标对模型剩余部分的参数进行更新;S48: Based on the clear image y after dehazing and the original clear image x, construct the loss function for the remaining part of the dual-branch multi-scale image dehazing network model, including smoothing l 1 loss function, MS-SSIM loss, perceptual loss, and adversarial loss. Update the parameters of the remaining part of the model with the minimum loss function as the optimization objective. 所述模型剩余部分的参数,包括:VQ解码器、增强解码器、码本匹配模块的参数。The parameters of the remaining part of the model include: parameters of the VQ decoder, the enhanced decoder, and the codebook matching module. 6.根据权利要求1所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,码本匹配模块的匹配过程表示为:6. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the matching process of the codebook matching module is expressed as follows: F(fk,α)=fk×eα F( fk , α) = fk × e ^(α) 其中,M(z)表示码本匹配模块的匹配过程,F()表示根据频率差生成的权重函数,fk表示有雾图像和清晰图像在码本激活上的频率差,α表示用于调节去雾的程度的参数,||z-zk||表示有雾图像离散化特征z和码本编码zk之间的距离,argmin( )表示取距离的最小值函数。Where M(z) represents the matching process of the codebook matching module, F() represents the weight function generated based on the frequency difference, fk represents the frequency difference between the foggy image and the clear image in the codebook activation, α represents the parameter used to adjust the degree of defogging, || zzk || represents the distance between the discretized feature z of the foggy image and the codebook encoding zk , and argmin() represents the function to take the minimum value of the distance. 7.根据权利要求1所述的一种基于高质量码本的双分支多尺度图像去雾方法,其特征在于,双分支多尺度图像去雾网络模型的编码器损失函数,包括:7. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the encoder loss function of the dual-branch multi-scale image dehazing network model includes: 其中,LVQ表示双分支多尺度图像去雾网络模型的编码器损失函数,zh表示有雾图像的中间特征,表示无雾图像的中间特征,λstyle和λadv表示用于调节不同损失的权重的第一、第二超参数,Ψ()表示用于衡量风格损失的矩阵,E[]表示编码器,D()表示判别器,表示有雾图像的第i个中间特征。Where LVQ represents the encoder loss function of the two-branch multi-scale image dehazing network model, and zh represents the intermediate features of the hazy image. Let λ<sub>style</sub> and λ<sub>adv</sub> represent the intermediate features of the haze-free image, λ<sub>style</sub> and λ<sub>adv</sub> represent the first and second hyperparameters used to adjust the weights of different losses, Ψ<sub>()</sub> represent the matrix used to measure the style loss, E[] represents the encoder, and D<sub>()</sub> represents the discriminator. This represents the i-th intermediate feature of a foggy image.
CN202411445292.XA 2024-10-16 2024-10-16 Double-branch multi-scale image defogging method based on high-quality codebook Active CN119399068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411445292.XA CN119399068B (en) 2024-10-16 2024-10-16 Double-branch multi-scale image defogging method based on high-quality codebook

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411445292.XA CN119399068B (en) 2024-10-16 2024-10-16 Double-branch multi-scale image defogging method based on high-quality codebook

Publications (2)

Publication Number Publication Date
CN119399068A CN119399068A (en) 2025-02-07
CN119399068B true CN119399068B (en) 2025-11-07

Family

ID=94419698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411445292.XA Active CN119399068B (en) 2024-10-16 2024-10-16 Double-branch multi-scale image defogging method based on high-quality codebook

Country Status (1)

Country Link
CN (1) CN119399068B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0739140A2 (en) * 1995-04-18 1996-10-23 Sun Microsystems, Inc. Encoder for an end-to-end scalable video delivery system
CN115689932A (en) * 2022-11-09 2023-02-03 重庆邮电大学 Image defogging method based on deep neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12511723B2 (en) * 2023-02-28 2025-12-30 Nanjing University Of Posts And Telecommunications Single image dehazing method based on detail recovery
CN117788450A (en) * 2023-12-29 2024-03-29 山东省计算中心(国家超级计算济南中心) Remote sensing image change detection method and device based on Transformer and DCN
CN118411440A (en) * 2024-04-24 2024-07-30 湖州师范学院 Remote sensing image reconstruction method based on remote sensing image compression network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0739140A2 (en) * 1995-04-18 1996-10-23 Sun Microsystems, Inc. Encoder for an end-to-end scalable video delivery system
CN115689932A (en) * 2022-11-09 2023-02-03 重庆邮电大学 Image defogging method based on deep neural network

Also Published As

Publication number Publication date
CN119399068A (en) 2025-02-07

Similar Documents

Publication Publication Date Title
CN113822969B (en) Training neural radiation field model, face generation method, device and server
CN118314353B (en) Remote sensing image segmentation method based on double-branch multi-scale feature fusion
Aakerberg et al. Semantic segmentation guided real-world super-resolution
CN114913599B (en) Video abnormal behavior detection method and system based on automatic encoder
CN114170088A (en) Relational reinforcement learning system and method based on graph structure data
CN115731597B (en) Automatic segmentation and restoration management platform and method for mask image of face mask
CN113808031A (en) Image restoration method based on LSK-FNet model
CN116137043B (en) A colorization method for infrared images based on convolution and Transformer
CN117710671A (en) Medical image segmentation method based on segmentation large model fine adjustment
CN114943894A (en) ConvCRF-based high-resolution remote sensing image building extraction optimization method
CN119992550B (en) Image segmentation method, model, model training method and image segmentation system
CN117011701B (en) A remote sensing image feature extraction method based on hierarchical feature autonomous learning
CN117315543B (en) A semi-supervised video target segmentation method based on confidence-gated spatiotemporal memory networks
CN112633234A (en) Method, device, equipment and medium for training and applying face glasses-removing model
CN116630369A (en) UAV target tracking method based on spatio-temporal memory network
Wang et al. Self-prior guided pixel adversarial networks for blind image inpainting
CN117252757A (en) Hyperspectral image super-resolution method and system based on natural image prior
CN117876679A (en) A remote sensing image scene segmentation method based on convolutional neural network
CN116704585A (en) A Face Recognition Method Based on Quality Perception
CN119399068B (en) Double-branch multi-scale image defogging method based on high-quality codebook
CN119625308A (en) A semi-supervised video object segmentation method and system based on global and local feature fusion
CN119963957A (en) A multimodal image fusion method based on SwinTransformer
CN117557473B (en) A knowledge-guided multi-sensory attention network image dehazing method
CN114764880B (en) Remote sensing image scene classification method based on multi-component GAN reconstruction
CN117292299A (en) A video anomaly detection method based on optical flow decomposition and spatiotemporal feature learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant