CN119399068B

CN119399068B - Double-branch multi-scale image defogging method based on high-quality codebook

Info

Publication number: CN119399068B
Application number: CN202411445292.XA
Authority: CN
Inventors: 尹学辉; 武沛鑫; 李泽宏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2024-10-16
Filing date: 2024-10-16
Publication date: 2025-11-07
Anticipated expiration: 2044-10-16
Also published as: CN119399068A

Abstract

The invention relates to the technical field of image processing, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the steps of obtaining an original image super-resolution reconstruction data set; training VQGAN network model by using original image super-resolution reconstruction data set to obtain network structure of codebook and VQ decoder and its correspondent parameters, obtaining original image defogging data set, training double-branch multi-scale image defogging network model by using original image defogging data set, inputting foggy image into trained double-branch multi-scale image defogging network model to obtain clear foggy image.

Description

Double-branch multi-scale image defogging method based on high-quality codebook

Technical Field

The invention relates to the technical field of computer image restoration, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook.

Background

Mist is an aerosol system composed of a large number of tiny water droplets suspended in near-ground air, and is one of the main causes of image blurring, color distortion, and contrast reduction.

In the extreme weather frequency in the years, many cities are affected by foggy days, seasonal frequent situations are presented, the foggy days are affected, the accuracy of image acquisition information acquired by outdoor visual systems such as automatic driving, video monitoring, military reconnaissance and remote sensing images is reduced, the quality of images is rapidly deteriorated along with the foggy aggravation and the occurrence of uneven foggy, color distortion, characteristic blurring, contrast reduction and the degradation of other visual quality are caused, objects and backgrounds in the images cannot be identified, the execution and processing of subsequent various visual tasks such as semantic segmentation in computer vision, the effect of tasks such as target detection and the like are seriously affected, and therefore, pretreatment is needed for the images, so that the influence of the foggy on the imaging quality of the images is reduced.

The current image defogging main stream technical scheme based on deep learning comprises the following steps:

Based on a physical model and priori knowledge, a deep learning method is utilized, image enhancement preprocessing operation is assisted, a convolutional neural network is adopted to estimate parameters in an atmospheric scattering model in many researches, in order to avoid accumulated errors in the parameter estimation process, an end-to-end network is proposed, a foggy image is directly estimated and generated by a foggy image, the foggy image is usually obtained by the physical model mainly after the atmospheric scattering model is used for processing and degrading the clear foggy image, the atmospheric scattering model cannot perfectly describe the formation process of all foggies, therefore, the image obtained by artificial processing cannot well replace a real foggy image, the trained model has poor generalization capability and is often invalid when being used for processing the real image, and therefore, the methods can finally obtain excellent performance on a synthetic data set, but the performance on the real data set is still to be improved.

Disclosure of Invention

In order to solve the technical problems, the invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the following steps:

S1, acquiring an original image super-resolution reconstruction data set, wherein the original image super-resolution reconstruction data set comprises an original clear image;

S2, training VQGAN a network model by utilizing an original image super-resolution reconstruction data set to obtain a codebook, a network structure of a VQ decoder and corresponding parameters thereof;

the VQGAN network model comprises a VQ encoder, a codebook and a VQ decoder;

s3, acquiring an original image defogging data set, wherein the original image defogging data set comprises an original foggy image and a corresponding clear defogging image;

S4, training a double-branch multi-scale image defogging network model by using an original image defogging data set;

The dual-branch multi-scale image defogging network model is divided into a priori matching branch and a channel attention branch, wherein the priori matching branch comprises a VQGAN network model with fixed parameters, a pyramid cavity neighborhood attention encoder and an enhancement decoder, and the structure of the prior matching branch comprises a VQ encoder, a pyramid cavity neighborhood attention encoder, a codebook matching module with fixed parameters, a VQ decoder with fixed parameters and an enhancement decoder in sequence, and the channel attention branch comprises a convolution with 3 multiplied by 3 and 4 residual channel attention layers;

S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image.

Compared with the prior art, the method has the advantages that the discrete codebook is obtained by training clear haze-free images, the high-quality priori knowledge with the original image colors and structures is packaged, then a double-branch neural network structure, namely a priori matching branch and a channel attention branch, is constructed, the global features of the haze images are extracted by utilizing the neighborhood attention based on a Transformer and the channel attention based on convolution, complex interaction features between the haze areas and bottom scenes are learned, the features extracted by the two branches are fused through a feature fusion module, the region affected by haze in the image is replaced through a controllable distance recalculation operation in the matching process of the high-quality priori constraint codebook and the haze image features, and the end-to-end image defogging process is realized by reconstructing the original haze images, so that the definition and the recognizability of the haze images are improved.

Drawings

FIG. 1 is a flow chart of a dual-branch multi-scale image defogging method based on a high-quality codebook;

FIG. 2 is a schematic diagram of a dual-branch multi-scale defogging network structure based on a high quality codebook according to the present invention;

FIG. 3 is a schematic diagram of the structure of a VQ encoder and a VQ decoder according to the present invention;

FIG. 4 is a diagram of a pyramid hole neighborhood attention encoder of the present invention;

Fig. 5 is a block diagram of an enhancement decoder according to the present invention.

Detailed Description

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making creative efforts based on the embodiments of the present invention are included in the protection scope of the present invention.

The invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which is shown in figure 1, wherein the whole image is trained in two stages;

the first stage is to use VQGAN to train and obtain a discrete codebook of detail texture characteristics which is common to the reaction clear image in advance, and in this stage, the network is composed of a VQ encoder, a codebook and a VQ decoder, and the training aim is to obtain the codebook which stores the high-quality clear image characteristics and the corresponding VQ decoder;

s1, acquiring an original image super-resolution reconstruction data set;

The original image super-resolution reconstruction data set comprises an original clear image;

In the specific implementation, the data set is Flickr2K, and the data set has 1000 clear pictures, including contents of characters, buildings, animals, objects and the like, and adopts a low-resolution subset which is sampled 4 times by a bicubic method, wherein the specific resolution is about 512 multiplied by 349, 800 pictures are used as training, 100 pictures are used as verification, and 100 pictures are used as testing;

Wherein, as shown in fig. 2, VQGAN network model includes VQ encoder, codebook, VQ decoder, wherein VQ encoder and VQ decoder structure is shown in fig. 3;

Preferably, training VQGAN the network model in S2 includes:

S21, inputting an original clear image x into a VQ encoder to obtain a potential feature map z;

Specifically, the VQ encoder and decoder adopt a network architecture based on UNet, the first half is feature extraction, the second half is upsampling, unet has been proved to perform well in the fields of image classification, segmentation and the like, and since some edge features are lost while the encoder downsamples and refines the features, a plurality of residual structures are adopted, and one retrieval of the edge features is realized through feature concatenation;

S22, matching the potential feature map z to the nearest element in the codebook so as to obtain a codebook-passing discrete feature map z ^q;

The codebook is a discrete codebook (Discrete codebook), the discrete codebook compresses detail texture characteristics of an image and plays a key role in image reconstruction, the latest research result VQGAN in the field of image generation is adopted and is modified to a certain extent, specific technical details are shown in ESSER P,ROMBACH R,OMMER B.Taming transformers for high-resolution image synthesis[C].IEEE International Conference on Computer Vision and Pattern Recognition,Virtual Event,2021:12873-12883,, semantic contained in an defogging image is compressed through training on the defogging image, a codebook with rich semantics is built through model self-supervision learning, the compression rate is improved, and meanwhile, good perception quality can be kept, so that the codebook is used as priori knowledge of the defogging image and is used as constraint conditions when a subsequent defogging model generates the defogging image;

The mathematical representation of the codebook is as Given a high quality image x, as input to the VQ encoder E _vq, a latent feature map z is output, and then each pixel z _ij of z is matched to the nearest element in the codebook, thereby obtaining a codebook-passed discrete feature mapThe discretized features are then input to VQ decoder D _vq, which in turn obtains a processed image y, the whole process can be expressed as follows:

z_ij＝E_vq(x_ij)

Where Z _ij denotes a pixel at the potential feature map Z position ij, E _vq () denotes a VQ encoder, x _ij denotes a pixel at the high quality image x position ij, Representing the codebook-passing discrete feature map, M () represents the codebook matching operation, |z-Z _k || represents the distance between the foggy image discretization feature Z and the codebook code Z _k, argmin () represents the minimum function of the distance, y _ij represents the pixel at the generation image position ij,

Notably, the discrete operations in between the encoder and decoder during training of the network are not counter-propagating through the gradient, since the operations are not differentiable, so that only the gradient is copied from the decoder to the encoder during counter-propagating, so that the model can be trained end-to-end through the loss function;

Features of the clear images are compressed into short vectors through discretization operation and stored in a codebook, so that the discrete codebook compresses detailed texture features of the images, plays a key role in image reconstruction, and the reconstructed images use a discriminator to judge the authenticity of the images so as to assist the network to learn further until the discriminator cannot judge the authenticity of the images, and then training is completed;

S23, constructing a codebook discretization loss function according to the potential feature map z and the discrete feature map z ^q, and updating parameters of a VQ encoder and a codebook of the model by taking the minimum loss function as an optimization target;

Wherein the codebook discretization loss function is defined as follows:

Wherein L _Z represents a codebook discretization loss function, Z ^q represents a discrete feature map, Z represents a potential feature map, beta represents a weight factor, sg [ Z ^q ] represents a stop gradient-gradient, and the loss function is mainly used for measuring discretization loss generated between the output Z of an encoder and a discrete vector Z ^q;

s24, sending the discrete feature map z ^q to a VQ decoder for decoding to obtain a reconstructed clear image y;

s25, constructing a loss function of the VQ decoder according to the reconstructed clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;

Details of this step are given in S46;

Constructing a double-branch multi-scale neural network in the second stage, extracting global features of a foggy image and learning complex interaction features between a foggy region and a bottom scene by using the neighborhood attention based on a Transformer and the channel attention based on convolution, assisting the reconstruction process of the foggy image by means of a discrete codebook obtained from a clear image in the first stage, specifically, matching the corresponding discrete codebook and replacing the region influenced by foggy in the image by a controllable distance recalculation operation, and further achieving the defogging effect;

S3, acquiring an original image defogging data set;

The original image defogging data set comprises an original foggy image and a corresponding clear foggy image;

In practice, the data sets used are O-HAZE, I-HAZE, DENSE-HAZE, NH-HAZE-20, NH-HAZE-21, NH-HAZE-23, wherein O-HAZE and I-HAZE comprise a fog of 45 pairs of outdoor scenes and 35 pairs of indoor scenes, respectively, and the corresponding clear images, and DENSE-HAZE is characterized by a DENSE uniform HAZE scene, and the data sets comprise 55 pairs of true DENSE fog images and corresponding non-fog images of various outdoor scenes, the DENSE fog images being generated by a professional HAZE machine, the generated DENSE fog images being nearly indistinguishable from the objects originally present in the images, the defogging difficulty being very large compared to a conventional data set, the NH-HAZE data set being a set of true non-uniform fog images and corresponding non-fog image pairs, wherein the non-uniform fog is introduced by simulation of a real day condition by a fog generator, and is continuously divisible into NH-HAZE-20, NH-HAZE 21, NH-zze-25, and a more detailed map of the images of the specific HAZE images of the specific image pair of the specific HAZE machine, respectively, and the detailed map 1-25, and the detailed map of the specific HAZE images are shown in the detailed map, respectively:

table 1:

The defogging network model of the double-branch multi-scale image is divided into a priori matching branch and a channel attention branch;

The prior matching branch comprises a VQ encoder, a pyramid hole neighborhood attention encoder, a fixed parameter codebook matching module, a fixed parameter VQ decoder and an enhancement decoder, wherein the pyramid hole neighborhood attention encoder is shown in figure 4, and the enhancement decoder is shown in figure 5;

The channel attention branches include a3 x 3 convolution, 4 residual channel attention layers;

and finally fusing the results of the two branches through a feature fusion structure;

Preferably, training the dual-branch multi-scale image defogging network model in S4 includes:

S41, inputting an original foggy image x into a VQ encoder to roughly extract features to obtain a preliminary feature F ₁, and then inputting the preliminary feature F ₁ into a pyramid cavity neighborhood attention encoder to obtain an advanced feature F ₂;

the VQ encoder performs well when encoding a clear image, but has a poor capability of dense fog or nonuniform fog in encoding, mainly because in a defogging task, the encoder is required to extract general structural texture features in the image and distinguish a fog region in the image, and the VQ encoder has a shallower network architecture and cannot finish the task well;

In order to fully extract global features such as texture and structure of a foggy image, the invention designs an encoder based on hole neighborhood attention in a prior verification matching branch, wherein the hole neighborhood attention is a self-attention variant in Vision Transformer, and is an effective and expandable visual sliding window attention mechanism, and meanwhile, the visual downstream performance exceeds Vision Transformer and Swin transform;

Neighborhood attention transducer consists of a Multi-layer perceptron (Multi-layer perception, MLP), normalization layer (LayerNorm, LN), residual connection and Multi-head neighborhood attention (Multi-head Neighborhood Attention, NA), each pixel only focuses on the neighborhood of 1 pixel around itself when the neighborhood size is minimum, and neighborhood attention output and self-attention are equal when the neighborhood size reaches maximum;

Not only does neighborhood attention reduce computational costs compared to self-attention, but introduces local inductive bias like convolution, in particular NA is a pixel-by-pixel operation, locating self-attention (self attention, SA) to nearest neighbor pixels, thus neighborhood attention has linear temporal and spatial complexity compared to the quadratic complexity of self-attention;

In the encoder of the hole neighborhood attention designed by the invention, four feature graphs with different resolutions are obtained through a serializer and two times of downsampling, the feature information of each layer in front is used as the input of the next layer through cascading operation by utilizing a pyramid structure, and the features of different layers are aggregated, so that the feature reuse of different scales is realized;

Therefore, the invention adopts a neighborhood attention mechanism, focuses on the global features of the image, and utilizes high-quality prior to perform feature matching, thereby improving the generalization of the network;

firstly, roughly extracting image position and structure information by using a VQ encoder, then serializing shallow features into an input neighborhood attention transducer block by using an overlapping serializer (Overlapping Tokenizer), connecting a downsampler behind a second neighborhood attention transducer block, reducing the space size to be half of the original space size by using the downsampler, and doubling the number of channels, so that feature images with different scales are generated;

In order to fuse feature graphs of different scales, the invention designs a pyramid-shaped feature aggregation mode, a plurality of dense connection operations are added in the feature aggregation mode, because the direct addition operation can lose some original feature information in feature fusion, and the cascade operation is a lossless operation in a strict sense, the dense connection is adopted in the gradual processing process of the feature graphs, the feature information of each layer in front is used as the input of the next layer through the addition and cascade operation, so that the features of different levels are aggregated, the feature reuse of different scales is realized, the global information in the features is optimized, and the fusion of the features of different levels is allowed through a plurality of residual connections, so that the multi-scale features of the image mist distribution are extracted, and the subsequent high-quality prior matching is facilitated;

S42, inputting the advanced feature F ₂ into a codebook matching module with fixed parameters for matching to obtain a matched feature F ₃;

In the process of reconstructing an image by using a high-quality codebook, discrete codes output by an encoder are difficult to match to the corresponding high-quality codebook, mainly because the image is seriously degraded, and a foggy image can have a problem of domain gaps to cause inconsistent data distribution, so that the distance between the output of the encoder and the codebook needs to be matched, and the problem caused by the domain gaps is reduced by utilizing the matching operation of a controllable distance recalculation method, so that a better reconstruction effect is achieved;

calculating the distance between the foggy image discrete code and each code in the codebook to find the codebook code with the minimum distance, and then adjusting the finally calculated distance through a weight function F, thereby obtaining a matching formula:

F(f_k,α)＝f_k×e^α

Wherein M (z) represents a matching process of the codebook matching module, F () represents a weight function generated according to the frequency difference, F _k represents the frequency difference of the foggy image and the clear image on codebook activation, α represents a parameter for adjusting the degree of defogging, |z-z _k ||represents a distance between the foggy image discretization feature z and the codebook code z _k, and argmin () represents a minimum function taking the distance.

The method comprises the steps that the activation frequency difference f _k and alpha of a codebook code are needed to be solved in a matching formula, the activation frequency difference of each code in the codebook is set to be 0 when defogging network training begins, if a current foggy image is not matched with the codebook code and a clear image is matched with the codebook code, the frequency difference on the code is updated, after multiple training, the network learns an optimal value, the difference between the codes of the encoder and the clear foggy image is represented by the difference between two probability distributions according to the value of alpha, the Coebeck-Lybber degree of scattering (Kullback-Leibler Divergence, KL degree of scattering) is an index for measuring the similarity of the two probability distributions, the larger the similarity is, the value of KL degree of scattering is small, the probability distribution of the codebook activation of the clear image is P _c, the probability distribution corresponding to the foggy image is P _h, the probability distribution of the foggy image can be adjusted through alpha, and alpha of the two different domains is the optimal P _h(x＝z_k, and the best value of the foggy image is calculated in the field of alpha is 35;

S43, inputting an original clear fog-free image into the pre-trained VQGAN network model in the S2, and sequentially passing through a VQ encoder and a codebook to obtain an intermediate feature F ₄;

details of this step are given in S2;

S44, constructing an encoder loss function of a double-branch multi-scale image defogging network model according to the preliminary feature F ₁, the advanced feature F ₂, the matched feature F ₃ and the intermediate feature F ₄, and updating parameters of a VQ encoder and a pyramid cavity neighborhood attention encoder of the model by taking the minimum loss function as an optimization target;

to help the encoder output match to the correct high quality codebook prior at a later step, we need to purposefully make the encoder output characteristics meet a standard normal distribution consistent with training the high quality prior;

Assuming that the foggy image is input as x _h, the foggy image is input as x _gt, the defogging network encoder is E, the encoder used for training the codebook is E _vq, we can obtain the intermediate feature z _h＝E(x_h of the foggy image processed by the encoder E), and the intermediate feature of the foggy image processed by the encoder E _vq

In the process of controlling the image generation, we also need to control the style difference between the generated image and the haze-free image, so that the style loss is measured by using a psi, i.e. Gram matrix, and whether the generated features are real or not is judged by using a discriminator D when a codebook is trained, so that the final encoder loss is as follows:

Where L _VQ represents the encoder loss function of the dual-branch multi-scale image defogging network model, z _h represents the intermediate features of the foggy image, Representing intermediate features of the haze-free image, lambda _style and lambda _adv representing first and second hyper-parameters for adjusting weights of different losses, ψ () representing a matrix for measuring a style loss, E [ ] representing an encoder, D () representing a discriminator,Representing the ith intermediate feature of the hazy image.

S45, sequentially sending the characteristic F ₃ to a fixed-parameter VQ decoder and an enhancement decoder for decoding to obtain an intermediate characteristic F ₅;

Specifically, the output result obtained by the VQ decoder is easy to lack detail information in some places with deeper haze, and the image structure and texture are more fuzzy, in order to improve the decoding capability of detail features of a foggy image, the invention designs an enhancement decoder based on multiple attentions in a first-pass matching branch, combines channel attentions and pixel attentions, and finally ensures that feature details with different scales are embedded into a final result by an enhancement block based on pyramid pooling;

S46, inputting an original foggy image into a channel attention branch to obtain an intermediate feature F ₆;

The attention branches of the channels are increased to pay attention to non-uniform haze and dense fog areas with obvious brightness changes, the over-enhancement problem is avoided, the overall reconstruction performance of images is improved, the attention mechanism can enable a network to flexibly pay attention to the haze characteristics so as to reconstruct high-quality haze-free images, the brightness of the shielding areas of the non-uniform haze and the dense fog can be obviously increased, and the restoration of the areas with obvious brightness changes such as sky removal, snow and the like is further paid attention to, so that the over-enhancement problem can be avoided, and the overall reconstruction performance of the images is improved;

s47, adding the intermediate features F ₅、F₆ through a channel, and then obtaining a generated defogging image y through a feature fusion module;

the final feature fusion part adopts a feature fusion module consisting of a reflection filling layer, a convolution layer and a Tanh activation function, and fuses the outputs of the two branches;

s48, constructing a loss function of the VQ decoder according to the defogged clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;

at this stage it is necessary to determine whether the image eventually generated by the whole network has correctly completed the defogging task, noting that the parameters of the encoder and decoder are not updated in the same dimension due to the separate calculation of the losses, so that the gradient at this stage cannot be counter-propagated to the encoder;

The loss of the other parts of the network is calculated by adopting the following loss combination;

Smoothing L1 loss:

The advantage of the L1 loss function (also called average absolute error) is combined with the advantages of L1 and L2 loss, the average error amplitude of the distance between the predicted value and the true value is measured, and the advantage of the L1 loss function is that the derivative at the 0 point is continuous, so that the solving efficiency is higher, and the convergence speed is faster;

Wherein x _i and y _i respectively represent the ith pixel of the clear image and the foggy image, N is the total number of pixel points, f _θ (to) represents the defogging network, and f _θ(x_i) represents the ith pixel of the image reconstructed by the defogging network;

MS-SSIM loss:

MS-SSIM loss based on the assumption that human eyes acquire image structure information can provide reference on an image quality perception standard, and setting O and G to respectively represent two windows with ith pixel in a defogging image and a real image as a center, and applying Gaussian filters on the two windows to generate corresponding mean (mu _O,μ_G), standard deviation sigma _O、σ_G and covariance sigma _OG, wherein the MS-SSIM loss can be expressed as follows:

Wherein, C ₁,C₂ is two very small values, which is helpful for the stability of the score in the formula;

Perceptual loss:

The perception loss is measured by adopting the VGG16 pre-trained on the ImageNet data set, wherein the ImageNet data set is a computer vision data set which is created by professor Li Feifei of Stanford university and the like, the data set comprises more than 1400 ten thousand pictures and more than 2 ten thousand classifications, and the VGG16 pre-trained on the data set has good perception effect and can help the model to reconstruct finer details;

Let x and y represent the foggy image and clean image respectively, f _θ (x) represents the image reconstructed by defogging network, Φ _j (·) represents the feature map processed by the j-th layer of VGG16, L ₂ loss is used to measure the distance between the reconstructed image and the clean image feature map, N represents the number of feature maps used for calculating the perception loss;

Countering losses:

since pixel-based loss functions do not provide adequate supervision over small data sets, countermeasures against the above-mentioned loss are added in order to remedy the drawbacks of the loss;

Wherein D represents a discriminator (Discriminator) employed in training the codebook, M represents the number of sample data;

Total loss:

the weights of the preliminary prescribed smoothing l ₁ loss function, MS-SSIM loss, perceived loss, counterloss are 1, 0.5, 0.01, 0.0005, respectively, and then are adjusted according to the experimental conditions, so that the total image reconstruction loss is expressed as follows:

L=L_l1+0.5L_MS-SSIM+0.01L_perc+0.0005L_adv

S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image;

Preferably, during training, the input image is randomly cropped to 256×256 size and the dataset is expanded by scaling, random rotation and flipping, using Adam optimizer, defaults β ₁ and β ₂ are 0.9 and 0.99, respectively, initial learning rate is 0.0001, batch size is 1, and the model is implemented on an NVIDIA V100 Tensor Core based on Pytorch.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A dual-branch, multi-scale image dehazing method based on a high-quality codebook, characterized in that it includes:

S1: Obtain the original image super-resolution reconstruction dataset, which includes the original sharp image;

S2: Train the VQGAN network model using the original image super-resolution reconstruction dataset to obtain the codebook, the network structure of the VQ decoder, and its corresponding parameters;

The VQGAN network model includes a VQ encoder, a codebook, and a VQ decoder;

S3: Obtain the original image dehazing dataset, which includes: the original hazy image and the corresponding clear, hazy-free image;

S4: Train the dual-branch multi-scale image dehazing network model using the original image dehazing dataset;

The dual-branch multi-scale image dehazing network model is divided into a prior matching branch and a channel attention branch. The prior matching branch includes a VQGAN network model with fixed parameters, a pyramid-shaped hollow neighborhood attention encoder, and an enhancement decoder. Its structure consists of, in order, a VQ encoder, a pyramid-shaped hollow neighborhood attention encoder, a codebook matching module with fixed parameters, a VQ decoder with fixed parameters, and an enhancement decoder. The channel attention branch includes a 3×3 convolution and four residual channel attention layers. Finally, the results of the two branches are fused through a feature fusion structure.

The pyramid-shaped cavity neighborhood attention encoder includes: a serializer, a neighborhood attention Transformer block, and a downsampling unit;

The pyramid cavity neighborhood attention encoder processing procedure includes the following steps:

Step 1: The input data is processed by a serializer and downsampled twice to obtain four feature maps of different resolutions. A pyramid structure is used to aggregate the features of each layer as the input of the next layer through cascading operations.

Step 2: Multiple residual connections are used to allow the fusion of features at different levels, thereby extracting multi-scale features of fog distribution in the image and obtaining the output features;

S5: Input the foggy image into the trained bi-branch multi-scale image dehazing network model to obtain a clear, fog-free image.

2. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the step of training the VQGAN network model in step S2 is as follows:

S21: Input the original clear image x into the VQ encoder E _vq based on the UNet network architecture for extraction and sampling, and use multiple residual structures to stitch together edge features during the sampling process to obtain the latent feature map z.

S22: Match the latent feature map z to the nearest element in the codebook to obtain the discrete feature map z ^q after passing through the codebook;

S23: Based on the latent feature map z and the discrete feature map ^zq , construct the codebook discretization loss function, and update the parameters of the VQ encoder and codebook of the model with the minimum loss function as the optimization objective;

S24: The discrete feature map z ^q is fed into the VQ decoder to obtain the reconstructed clear image y;

S25: Based on the reconstructed sharp image y and the original sharp image x, construct the loss function of the VQ decoder, and update the parameters of the VQ decoder of the model with the goal of minimizing the loss function.

3. A dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1 or 2, characterized in that the mathematical expression of the codebook is:

Where Z represents the codebook, z _k represents the codebook encoding, and K represents the number of codebook encodings. This indicates the codebook encoding set.

4. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 2, characterized in that the codebook discretization loss function includes:

Where L _Z represents the codebook discretization loss function, sg[] represents the stopping gradient, z ^q represents the discrete feature map, z represents the latent feature map, and β represents the weight factor.

5. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the training step of the dual-branch multi-scale image dehazing network model in S4 includes the following steps:

S41: Input the original foggy image x into the VQ encoder to roughly extract features and obtain preliminary features _F1 . Then input the preliminary features _F1 into the pyramid hole neighborhood attention encoder to obtain advanced features _F2 .

S42: Input the high-level feature _F2 into the codebook matching module with fixed parameters for matching, and obtain the matched feature _F3 ;

S43: Input the original clear, fog-free image into the pre-trained VQGAN network model in S2, and pass it through the VQ encoder and codebook in sequence to obtain the intermediate feature _F4 ;

S44: Based on the initial feature _F1 , high-level feature _F2 , and matched features _F3 and intermediate features _F4 , construct the encoder loss function of the dual-branch multi-scale image dehazing network model, and update the parameters of the VQ encoder and pyramid hollow neighborhood attention encoder of the model with the minimum loss function as the optimization objective.

S45: Feed feature _F3 sequentially into the VQ decoder and the enhancement decoder with fixed parameters to decode, and obtain intermediate feature _F5 ;

S46: Input the original foggy image into the channel attention branch to obtain intermediate features _F6 ;

S47: The intermediate features _F5 and _F6 are added together through the channel, and then the generated dehazed image y is obtained through the feature fusion module;

S48: Based on the clear image y after dehazing and the original clear image x, construct the loss function for the remaining part of the dual-branch multi-scale image dehazing network model, including smoothing l ₁ loss function, MS-SSIM loss, perceptual loss, and adversarial loss. Update the parameters of the remaining part of the model with the minimum loss function as the optimization objective.

The parameters of the remaining part of the model include: parameters of the VQ decoder, the enhanced decoder, and the codebook matching module.

6. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the matching process of the codebook matching module is expressed as follows:

F( _fk , α) = _fk × e ^{^(α)}

Where M(z) represents the matching process of the codebook matching module, F() represents the weight function generated based on the frequency difference, _fk represents the frequency difference between the foggy image and the clear image in the codebook activation, α represents the parameter used to adjust the degree of defogging, || _zzk || represents the distance between the discretized feature z of the foggy image and the codebook encoding _zk , and argmin() represents the function to take the minimum value of the distance.

7. The dual-branch multi-scale image dehazing method based on a high-quality codebook according to claim 1, characterized in that the encoder loss function of the dual-branch multi-scale image dehazing network model includes:

Where _LVQ represents the encoder loss function of the two-branch multi-scale image dehazing network model, and _zh represents the intermediate features of the hazy image. Let λ_style and λ_adv represent the intermediate features of the haze-free image, λstyle and λadv represent the first and second hyperparameters used to adjust the weights of different losses, Ψ() represent the matrix used to measure the style loss, E[] represents the encoder, and D() represents the discriminator. This represents the i-th intermediate feature of a foggy image.