CN119399068B - Double-branch multi-scale image defogging method based on high-quality codebook - Google Patents
Double-branch multi-scale image defogging method based on high-quality codebookInfo
- Publication number
- CN119399068B CN119399068B CN202411445292.XA CN202411445292A CN119399068B CN 119399068 B CN119399068 B CN 119399068B CN 202411445292 A CN202411445292 A CN 202411445292A CN 119399068 B CN119399068 B CN 119399068B
- Authority
- CN
- China
- Prior art keywords
- image
- codebook
- encoder
- branch
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/73—Deblurring; Sharpening
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
- G06T3/4053—Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the technical field of image processing, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the steps of obtaining an original image super-resolution reconstruction data set; training VQGAN network model by using original image super-resolution reconstruction data set to obtain network structure of codebook and VQ decoder and its correspondent parameters, obtaining original image defogging data set, training double-branch multi-scale image defogging network model by using original image defogging data set, inputting foggy image into trained double-branch multi-scale image defogging network model to obtain clear foggy image.
Description
Technical Field
The invention relates to the technical field of computer image restoration, in particular to a double-branch multi-scale image defogging method based on a high-quality codebook.
Background
Mist is an aerosol system composed of a large number of tiny water droplets suspended in near-ground air, and is one of the main causes of image blurring, color distortion, and contrast reduction.
In the extreme weather frequency in the years, many cities are affected by foggy days, seasonal frequent situations are presented, the foggy days are affected, the accuracy of image acquisition information acquired by outdoor visual systems such as automatic driving, video monitoring, military reconnaissance and remote sensing images is reduced, the quality of images is rapidly deteriorated along with the foggy aggravation and the occurrence of uneven foggy, color distortion, characteristic blurring, contrast reduction and the degradation of other visual quality are caused, objects and backgrounds in the images cannot be identified, the execution and processing of subsequent various visual tasks such as semantic segmentation in computer vision, the effect of tasks such as target detection and the like are seriously affected, and therefore, pretreatment is needed for the images, so that the influence of the foggy on the imaging quality of the images is reduced.
The current image defogging main stream technical scheme based on deep learning comprises the following steps:
Based on a physical model and priori knowledge, a deep learning method is utilized, image enhancement preprocessing operation is assisted, a convolutional neural network is adopted to estimate parameters in an atmospheric scattering model in many researches, in order to avoid accumulated errors in the parameter estimation process, an end-to-end network is proposed, a foggy image is directly estimated and generated by a foggy image, the foggy image is usually obtained by the physical model mainly after the atmospheric scattering model is used for processing and degrading the clear foggy image, the atmospheric scattering model cannot perfectly describe the formation process of all foggies, therefore, the image obtained by artificial processing cannot well replace a real foggy image, the trained model has poor generalization capability and is often invalid when being used for processing the real image, and therefore, the methods can finally obtain excellent performance on a synthetic data set, but the performance on the real data set is still to be improved.
Disclosure of Invention
In order to solve the technical problems, the invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which comprises the following steps:
S1, acquiring an original image super-resolution reconstruction data set, wherein the original image super-resolution reconstruction data set comprises an original clear image;
S2, training VQGAN a network model by utilizing an original image super-resolution reconstruction data set to obtain a codebook, a network structure of a VQ decoder and corresponding parameters thereof;
the VQGAN network model comprises a VQ encoder, a codebook and a VQ decoder;
s3, acquiring an original image defogging data set, wherein the original image defogging data set comprises an original foggy image and a corresponding clear defogging image;
S4, training a double-branch multi-scale image defogging network model by using an original image defogging data set;
The dual-branch multi-scale image defogging network model is divided into a priori matching branch and a channel attention branch, wherein the priori matching branch comprises a VQGAN network model with fixed parameters, a pyramid cavity neighborhood attention encoder and an enhancement decoder, and the structure of the prior matching branch comprises a VQ encoder, a pyramid cavity neighborhood attention encoder, a codebook matching module with fixed parameters, a VQ decoder with fixed parameters and an enhancement decoder in sequence, and the channel attention branch comprises a convolution with 3 multiplied by 3 and 4 residual channel attention layers;
S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image.
Compared with the prior art, the method has the advantages that the discrete codebook is obtained by training clear haze-free images, the high-quality priori knowledge with the original image colors and structures is packaged, then a double-branch neural network structure, namely a priori matching branch and a channel attention branch, is constructed, the global features of the haze images are extracted by utilizing the neighborhood attention based on a Transformer and the channel attention based on convolution, complex interaction features between the haze areas and bottom scenes are learned, the features extracted by the two branches are fused through a feature fusion module, the region affected by haze in the image is replaced through a controllable distance recalculation operation in the matching process of the high-quality priori constraint codebook and the haze image features, and the end-to-end image defogging process is realized by reconstructing the original haze images, so that the definition and the recognizability of the haze images are improved.
Drawings
FIG. 1 is a flow chart of a dual-branch multi-scale image defogging method based on a high-quality codebook;
FIG. 2 is a schematic diagram of a dual-branch multi-scale defogging network structure based on a high quality codebook according to the present invention;
FIG. 3 is a schematic diagram of the structure of a VQ encoder and a VQ decoder according to the present invention;
FIG. 4 is a diagram of a pyramid hole neighborhood attention encoder of the present invention;
Fig. 5 is a block diagram of an enhancement decoder according to the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making creative efforts based on the embodiments of the present invention are included in the protection scope of the present invention.
The invention provides a double-branch multi-scale image defogging method based on a high-quality codebook, which is shown in figure 1, wherein the whole image is trained in two stages;
the first stage is to use VQGAN to train and obtain a discrete codebook of detail texture characteristics which is common to the reaction clear image in advance, and in this stage, the network is composed of a VQ encoder, a codebook and a VQ decoder, and the training aim is to obtain the codebook which stores the high-quality clear image characteristics and the corresponding VQ decoder;
s1, acquiring an original image super-resolution reconstruction data set;
The original image super-resolution reconstruction data set comprises an original clear image;
In the specific implementation, the data set is Flickr2K, and the data set has 1000 clear pictures, including contents of characters, buildings, animals, objects and the like, and adopts a low-resolution subset which is sampled 4 times by a bicubic method, wherein the specific resolution is about 512 multiplied by 349, 800 pictures are used as training, 100 pictures are used as verification, and 100 pictures are used as testing;
S2, training VQGAN a network model by utilizing an original image super-resolution reconstruction data set to obtain a codebook, a network structure of a VQ decoder and corresponding parameters thereof;
Wherein, as shown in fig. 2, VQGAN network model includes VQ encoder, codebook, VQ decoder, wherein VQ encoder and VQ decoder structure is shown in fig. 3;
Preferably, training VQGAN the network model in S2 includes:
S21, inputting an original clear image x into a VQ encoder to obtain a potential feature map z;
Specifically, the VQ encoder and decoder adopt a network architecture based on UNet, the first half is feature extraction, the second half is upsampling, unet has been proved to perform well in the fields of image classification, segmentation and the like, and since some edge features are lost while the encoder downsamples and refines the features, a plurality of residual structures are adopted, and one retrieval of the edge features is realized through feature concatenation;
S22, matching the potential feature map z to the nearest element in the codebook so as to obtain a codebook-passing discrete feature map z q;
The codebook is a discrete codebook (Discrete codebook), the discrete codebook compresses detail texture characteristics of an image and plays a key role in image reconstruction, the latest research result VQGAN in the field of image generation is adopted and is modified to a certain extent, specific technical details are shown in ESSER P,ROMBACH R,OMMER B.Taming transformers for high-resolution image synthesis[C].IEEE International Conference on Computer Vision and Pattern Recognition,Virtual Event,2021:12873-12883,, semantic contained in an defogging image is compressed through training on the defogging image, a codebook with rich semantics is built through model self-supervision learning, the compression rate is improved, and meanwhile, good perception quality can be kept, so that the codebook is used as priori knowledge of the defogging image and is used as constraint conditions when a subsequent defogging model generates the defogging image;
The mathematical representation of the codebook is as Given a high quality image x, as input to the VQ encoder E vq, a latent feature map z is output, and then each pixel z ij of z is matched to the nearest element in the codebook, thereby obtaining a codebook-passed discrete feature mapThe discretized features are then input to VQ decoder D vq, which in turn obtains a processed image y, the whole process can be expressed as follows:
zij=Evq(xij)
Where Z ij denotes a pixel at the potential feature map Z position ij, E vq () denotes a VQ encoder, x ij denotes a pixel at the high quality image x position ij, Representing the codebook-passing discrete feature map, M () represents the codebook matching operation, |z-Z k || represents the distance between the foggy image discretization feature Z and the codebook code Z k, argmin () represents the minimum function of the distance, y ij represents the pixel at the generation image position ij,
Notably, the discrete operations in between the encoder and decoder during training of the network are not counter-propagating through the gradient, since the operations are not differentiable, so that only the gradient is copied from the decoder to the encoder during counter-propagating, so that the model can be trained end-to-end through the loss function;
Features of the clear images are compressed into short vectors through discretization operation and stored in a codebook, so that the discrete codebook compresses detailed texture features of the images, plays a key role in image reconstruction, and the reconstructed images use a discriminator to judge the authenticity of the images so as to assist the network to learn further until the discriminator cannot judge the authenticity of the images, and then training is completed;
S23, constructing a codebook discretization loss function according to the potential feature map z and the discrete feature map z q, and updating parameters of a VQ encoder and a codebook of the model by taking the minimum loss function as an optimization target;
Wherein the codebook discretization loss function is defined as follows:
Wherein L Z represents a codebook discretization loss function, Z q represents a discrete feature map, Z represents a potential feature map, beta represents a weight factor, sg [ Z q ] represents a stop gradient-gradient, and the loss function is mainly used for measuring discretization loss generated between the output Z of an encoder and a discrete vector Z q;
s24, sending the discrete feature map z q to a VQ decoder for decoding to obtain a reconstructed clear image y;
s25, constructing a loss function of the VQ decoder according to the reconstructed clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;
Details of this step are given in S46;
Constructing a double-branch multi-scale neural network in the second stage, extracting global features of a foggy image and learning complex interaction features between a foggy region and a bottom scene by using the neighborhood attention based on a Transformer and the channel attention based on convolution, assisting the reconstruction process of the foggy image by means of a discrete codebook obtained from a clear image in the first stage, specifically, matching the corresponding discrete codebook and replacing the region influenced by foggy in the image by a controllable distance recalculation operation, and further achieving the defogging effect;
S3, acquiring an original image defogging data set;
The original image defogging data set comprises an original foggy image and a corresponding clear foggy image;
In practice, the data sets used are O-HAZE, I-HAZE, DENSE-HAZE, NH-HAZE-20, NH-HAZE-21, NH-HAZE-23, wherein O-HAZE and I-HAZE comprise a fog of 45 pairs of outdoor scenes and 35 pairs of indoor scenes, respectively, and the corresponding clear images, and DENSE-HAZE is characterized by a DENSE uniform HAZE scene, and the data sets comprise 55 pairs of true DENSE fog images and corresponding non-fog images of various outdoor scenes, the DENSE fog images being generated by a professional HAZE machine, the generated DENSE fog images being nearly indistinguishable from the objects originally present in the images, the defogging difficulty being very large compared to a conventional data set, the NH-HAZE data set being a set of true non-uniform fog images and corresponding non-fog image pairs, wherein the non-uniform fog is introduced by simulation of a real day condition by a fog generator, and is continuously divisible into NH-HAZE-20, NH-HAZE 21, NH-zze-25, and a more detailed map of the images of the specific HAZE images of the specific image pair of the specific HAZE machine, respectively, and the detailed map 1-25, and the detailed map of the specific HAZE images are shown in the detailed map, respectively:
table 1:
S4, training a double-branch multi-scale image defogging network model by using an original image defogging data set;
The defogging network model of the double-branch multi-scale image is divided into a priori matching branch and a channel attention branch;
The prior matching branch comprises a VQ encoder, a pyramid hole neighborhood attention encoder, a fixed parameter codebook matching module, a fixed parameter VQ decoder and an enhancement decoder, wherein the pyramid hole neighborhood attention encoder is shown in figure 4, and the enhancement decoder is shown in figure 5;
The channel attention branches include a3 x 3 convolution, 4 residual channel attention layers;
and finally fusing the results of the two branches through a feature fusion structure;
Preferably, training the dual-branch multi-scale image defogging network model in S4 includes:
S41, inputting an original foggy image x into a VQ encoder to roughly extract features to obtain a preliminary feature F 1, and then inputting the preliminary feature F 1 into a pyramid cavity neighborhood attention encoder to obtain an advanced feature F 2;
the VQ encoder performs well when encoding a clear image, but has a poor capability of dense fog or nonuniform fog in encoding, mainly because in a defogging task, the encoder is required to extract general structural texture features in the image and distinguish a fog region in the image, and the VQ encoder has a shallower network architecture and cannot finish the task well;
In order to fully extract global features such as texture and structure of a foggy image, the invention designs an encoder based on hole neighborhood attention in a prior verification matching branch, wherein the hole neighborhood attention is a self-attention variant in Vision Transformer, and is an effective and expandable visual sliding window attention mechanism, and meanwhile, the visual downstream performance exceeds Vision Transformer and Swin transform;
Neighborhood attention transducer consists of a Multi-layer perceptron (Multi-layer perception, MLP), normalization layer (LayerNorm, LN), residual connection and Multi-head neighborhood attention (Multi-head Neighborhood Attention, NA), each pixel only focuses on the neighborhood of 1 pixel around itself when the neighborhood size is minimum, and neighborhood attention output and self-attention are equal when the neighborhood size reaches maximum;
Not only does neighborhood attention reduce computational costs compared to self-attention, but introduces local inductive bias like convolution, in particular NA is a pixel-by-pixel operation, locating self-attention (self attention, SA) to nearest neighbor pixels, thus neighborhood attention has linear temporal and spatial complexity compared to the quadratic complexity of self-attention;
In the encoder of the hole neighborhood attention designed by the invention, four feature graphs with different resolutions are obtained through a serializer and two times of downsampling, the feature information of each layer in front is used as the input of the next layer through cascading operation by utilizing a pyramid structure, and the features of different layers are aggregated, so that the feature reuse of different scales is realized;
Therefore, the invention adopts a neighborhood attention mechanism, focuses on the global features of the image, and utilizes high-quality prior to perform feature matching, thereby improving the generalization of the network;
firstly, roughly extracting image position and structure information by using a VQ encoder, then serializing shallow features into an input neighborhood attention transducer block by using an overlapping serializer (Overlapping Tokenizer), connecting a downsampler behind a second neighborhood attention transducer block, reducing the space size to be half of the original space size by using the downsampler, and doubling the number of channels, so that feature images with different scales are generated;
In order to fuse feature graphs of different scales, the invention designs a pyramid-shaped feature aggregation mode, a plurality of dense connection operations are added in the feature aggregation mode, because the direct addition operation can lose some original feature information in feature fusion, and the cascade operation is a lossless operation in a strict sense, the dense connection is adopted in the gradual processing process of the feature graphs, the feature information of each layer in front is used as the input of the next layer through the addition and cascade operation, so that the features of different levels are aggregated, the feature reuse of different scales is realized, the global information in the features is optimized, and the fusion of the features of different levels is allowed through a plurality of residual connections, so that the multi-scale features of the image mist distribution are extracted, and the subsequent high-quality prior matching is facilitated;
S42, inputting the advanced feature F 2 into a codebook matching module with fixed parameters for matching to obtain a matched feature F 3;
In the process of reconstructing an image by using a high-quality codebook, discrete codes output by an encoder are difficult to match to the corresponding high-quality codebook, mainly because the image is seriously degraded, and a foggy image can have a problem of domain gaps to cause inconsistent data distribution, so that the distance between the output of the encoder and the codebook needs to be matched, and the problem caused by the domain gaps is reduced by utilizing the matching operation of a controllable distance recalculation method, so that a better reconstruction effect is achieved;
calculating the distance between the foggy image discrete code and each code in the codebook to find the codebook code with the minimum distance, and then adjusting the finally calculated distance through a weight function F, thereby obtaining a matching formula:
F(fk,α)=fk×eα
Wherein M (z) represents a matching process of the codebook matching module, F () represents a weight function generated according to the frequency difference, F k represents the frequency difference of the foggy image and the clear image on codebook activation, α represents a parameter for adjusting the degree of defogging, |z-z k ||represents a distance between the foggy image discretization feature z and the codebook code z k, and argmin () represents a minimum function taking the distance.
The method comprises the steps that the activation frequency difference f k and alpha of a codebook code are needed to be solved in a matching formula, the activation frequency difference of each code in the codebook is set to be 0 when defogging network training begins, if a current foggy image is not matched with the codebook code and a clear image is matched with the codebook code, the frequency difference on the code is updated, after multiple training, the network learns an optimal value, the difference between the codes of the encoder and the clear foggy image is represented by the difference between two probability distributions according to the value of alpha, the Coebeck-Lybber degree of scattering (Kullback-Leibler Divergence, KL degree of scattering) is an index for measuring the similarity of the two probability distributions, the larger the similarity is, the value of KL degree of scattering is small, the probability distribution of the codebook activation of the clear image is P c, the probability distribution corresponding to the foggy image is P h, the probability distribution of the foggy image can be adjusted through alpha, and alpha of the two different domains is the optimal P h(x=zk, and the best value of the foggy image is calculated in the field of alpha is 35;
S43, inputting an original clear fog-free image into the pre-trained VQGAN network model in the S2, and sequentially passing through a VQ encoder and a codebook to obtain an intermediate feature F 4;
details of this step are given in S2;
S44, constructing an encoder loss function of a double-branch multi-scale image defogging network model according to the preliminary feature F 1, the advanced feature F 2, the matched feature F 3 and the intermediate feature F 4, and updating parameters of a VQ encoder and a pyramid cavity neighborhood attention encoder of the model by taking the minimum loss function as an optimization target;
to help the encoder output match to the correct high quality codebook prior at a later step, we need to purposefully make the encoder output characteristics meet a standard normal distribution consistent with training the high quality prior;
Assuming that the foggy image is input as x h, the foggy image is input as x gt, the defogging network encoder is E, the encoder used for training the codebook is E vq, we can obtain the intermediate feature z h=E(xh of the foggy image processed by the encoder E), and the intermediate feature of the foggy image processed by the encoder E vq
In the process of controlling the image generation, we also need to control the style difference between the generated image and the haze-free image, so that the style loss is measured by using a psi, i.e. Gram matrix, and whether the generated features are real or not is judged by using a discriminator D when a codebook is trained, so that the final encoder loss is as follows:
Where L VQ represents the encoder loss function of the dual-branch multi-scale image defogging network model, z h represents the intermediate features of the foggy image, Representing intermediate features of the haze-free image, lambda style and lambda adv representing first and second hyper-parameters for adjusting weights of different losses, ψ () representing a matrix for measuring a style loss, E [ ] representing an encoder, D () representing a discriminator,Representing the ith intermediate feature of the hazy image.
S45, sequentially sending the characteristic F 3 to a fixed-parameter VQ decoder and an enhancement decoder for decoding to obtain an intermediate characteristic F 5;
Specifically, the output result obtained by the VQ decoder is easy to lack detail information in some places with deeper haze, and the image structure and texture are more fuzzy, in order to improve the decoding capability of detail features of a foggy image, the invention designs an enhancement decoder based on multiple attentions in a first-pass matching branch, combines channel attentions and pixel attentions, and finally ensures that feature details with different scales are embedded into a final result by an enhancement block based on pyramid pooling;
S46, inputting an original foggy image into a channel attention branch to obtain an intermediate feature F 6;
The attention branches of the channels are increased to pay attention to non-uniform haze and dense fog areas with obvious brightness changes, the over-enhancement problem is avoided, the overall reconstruction performance of images is improved, the attention mechanism can enable a network to flexibly pay attention to the haze characteristics so as to reconstruct high-quality haze-free images, the brightness of the shielding areas of the non-uniform haze and the dense fog can be obviously increased, and the restoration of the areas with obvious brightness changes such as sky removal, snow and the like is further paid attention to, so that the over-enhancement problem can be avoided, and the overall reconstruction performance of the images is improved;
s47, adding the intermediate features F 5、F6 through a channel, and then obtaining a generated defogging image y through a feature fusion module;
the final feature fusion part adopts a feature fusion module consisting of a reflection filling layer, a convolution layer and a Tanh activation function, and fuses the outputs of the two branches;
s48, constructing a loss function of the VQ decoder according to the defogged clear image y and the original clear image x, and updating parameters of the VQ decoder of the model by taking the minimum loss function as an optimization target;
at this stage it is necessary to determine whether the image eventually generated by the whole network has correctly completed the defogging task, noting that the parameters of the encoder and decoder are not updated in the same dimension due to the separate calculation of the losses, so that the gradient at this stage cannot be counter-propagated to the encoder;
The loss of the other parts of the network is calculated by adopting the following loss combination;
Smoothing L1 loss:
The advantage of the L1 loss function (also called average absolute error) is combined with the advantages of L1 and L2 loss, the average error amplitude of the distance between the predicted value and the true value is measured, and the advantage of the L1 loss function is that the derivative at the 0 point is continuous, so that the solving efficiency is higher, and the convergence speed is faster;
Wherein x i and y i respectively represent the ith pixel of the clear image and the foggy image, N is the total number of pixel points, f θ (to) represents the defogging network, and f θ(xi) represents the ith pixel of the image reconstructed by the defogging network;
MS-SSIM loss:
MS-SSIM loss based on the assumption that human eyes acquire image structure information can provide reference on an image quality perception standard, and setting O and G to respectively represent two windows with ith pixel in a defogging image and a real image as a center, and applying Gaussian filters on the two windows to generate corresponding mean (mu O,μG), standard deviation sigma O、σG and covariance sigma OG, wherein the MS-SSIM loss can be expressed as follows:
Wherein, C 1,C2 is two very small values, which is helpful for the stability of the score in the formula;
Perceptual loss:
The perception loss is measured by adopting the VGG16 pre-trained on the ImageNet data set, wherein the ImageNet data set is a computer vision data set which is created by professor Li Feifei of Stanford university and the like, the data set comprises more than 1400 ten thousand pictures and more than 2 ten thousand classifications, and the VGG16 pre-trained on the data set has good perception effect and can help the model to reconstruct finer details;
Let x and y represent the foggy image and clean image respectively, f θ (x) represents the image reconstructed by defogging network, Φ j (·) represents the feature map processed by the j-th layer of VGG16, L 2 loss is used to measure the distance between the reconstructed image and the clean image feature map, N represents the number of feature maps used for calculating the perception loss;
Countering losses:
since pixel-based loss functions do not provide adequate supervision over small data sets, countermeasures against the above-mentioned loss are added in order to remedy the drawbacks of the loss;
Wherein D represents a discriminator (Discriminator) employed in training the codebook, M represents the number of sample data;
Total loss:
the weights of the preliminary prescribed smoothing l 1 loss function, MS-SSIM loss, perceived loss, counterloss are 1, 0.5, 0.01, 0.0005, respectively, and then are adjusted according to the experimental conditions, so that the total image reconstruction loss is expressed as follows:
L=Ll1+0.5LMS-SSIM+0.01Lperc+0.0005Ladv
S5, inputting the foggy image into a trained double-branch multi-scale image defogging network model to obtain a generated clear foggy image;
Preferably, during training, the input image is randomly cropped to 256×256 size and the dataset is expanded by scaling, random rotation and flipping, using Adam optimizer, defaults β 1 and β 2 are 0.9 and 0.99, respectively, initial learning rate is 0.0001, batch size is 1, and the model is implemented on an NVIDIA V100 Tensor Core based on Pytorch.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411445292.XA CN119399068B (en) | 2024-10-16 | 2024-10-16 | Double-branch multi-scale image defogging method based on high-quality codebook |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202411445292.XA CN119399068B (en) | 2024-10-16 | 2024-10-16 | Double-branch multi-scale image defogging method based on high-quality codebook |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN119399068A CN119399068A (en) | 2025-02-07 |
| CN119399068B true CN119399068B (en) | 2025-11-07 |
Family
ID=94419698
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202411445292.XA Active CN119399068B (en) | 2024-10-16 | 2024-10-16 | Double-branch multi-scale image defogging method based on high-quality codebook |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN119399068B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0739140A2 (en) * | 1995-04-18 | 1996-10-23 | Sun Microsystems, Inc. | Encoder for an end-to-end scalable video delivery system |
| CN115689932A (en) * | 2022-11-09 | 2023-02-03 | 重庆邮电大学 | Image defogging method based on deep neural network |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12511723B2 (en) * | 2023-02-28 | 2025-12-30 | Nanjing University Of Posts And Telecommunications | Single image dehazing method based on detail recovery |
| CN117788450A (en) * | 2023-12-29 | 2024-03-29 | 山东省计算中心(国家超级计算济南中心) | Remote sensing image change detection method and device based on Transformer and DCN |
| CN118411440A (en) * | 2024-04-24 | 2024-07-30 | 湖州师范学院 | Remote sensing image reconstruction method based on remote sensing image compression network |
-
2024
- 2024-10-16 CN CN202411445292.XA patent/CN119399068B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP0739140A2 (en) * | 1995-04-18 | 1996-10-23 | Sun Microsystems, Inc. | Encoder for an end-to-end scalable video delivery system |
| CN115689932A (en) * | 2022-11-09 | 2023-02-03 | 重庆邮电大学 | Image defogging method based on deep neural network |
Also Published As
| Publication number | Publication date |
|---|---|
| CN119399068A (en) | 2025-02-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113822969B (en) | Training neural radiation field model, face generation method, device and server | |
| CN118314353B (en) | Remote sensing image segmentation method based on double-branch multi-scale feature fusion | |
| Aakerberg et al. | Semantic segmentation guided real-world super-resolution | |
| CN114913599B (en) | Video abnormal behavior detection method and system based on automatic encoder | |
| CN114170088A (en) | Relational reinforcement learning system and method based on graph structure data | |
| CN115731597B (en) | Automatic segmentation and restoration management platform and method for mask image of face mask | |
| CN113808031A (en) | Image restoration method based on LSK-FNet model | |
| CN116137043B (en) | A colorization method for infrared images based on convolution and Transformer | |
| CN117710671A (en) | Medical image segmentation method based on segmentation large model fine adjustment | |
| CN114943894A (en) | ConvCRF-based high-resolution remote sensing image building extraction optimization method | |
| CN119992550B (en) | Image segmentation method, model, model training method and image segmentation system | |
| CN117011701B (en) | A remote sensing image feature extraction method based on hierarchical feature autonomous learning | |
| CN117315543B (en) | A semi-supervised video target segmentation method based on confidence-gated spatiotemporal memory networks | |
| CN112633234A (en) | Method, device, equipment and medium for training and applying face glasses-removing model | |
| CN116630369A (en) | UAV target tracking method based on spatio-temporal memory network | |
| Wang et al. | Self-prior guided pixel adversarial networks for blind image inpainting | |
| CN117252757A (en) | Hyperspectral image super-resolution method and system based on natural image prior | |
| CN117876679A (en) | A remote sensing image scene segmentation method based on convolutional neural network | |
| CN116704585A (en) | A Face Recognition Method Based on Quality Perception | |
| CN119399068B (en) | Double-branch multi-scale image defogging method based on high-quality codebook | |
| CN119625308A (en) | A semi-supervised video object segmentation method and system based on global and local feature fusion | |
| CN119963957A (en) | A multimodal image fusion method based on SwinTransformer | |
| CN117557473B (en) | A knowledge-guided multi-sensory attention network image dehazing method | |
| CN114764880B (en) | Remote sensing image scene classification method based on multi-component GAN reconstruction | |
| CN117292299A (en) | A video anomaly detection method based on optical flow decomposition and spatiotemporal feature learning |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |