CN117808707B

CN117808707B - Multi-scale image defogging method, system, device and storage medium

Info

Publication number: CN117808707B
Application number: CN202311861387.5A
Authority: CN
Inventors: 高珊珊; 毛德乾; 刘峥; 刘慧�; 潘晓
Original assignee: Shandong University of Finance and Economics
Current assignee: Shandong University of Finance and Economics
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-08-02
Anticipated expiration: 2043-12-28
Also published as: CN117808707A

Abstract

The invention discloses a multi-scale image defogging method, a system, equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be defogged; inputting an image to be defogged into a trained multi-scale image defogging network, and outputting the defogged image; the trained multiscale image defogging network is used for carrying out multiscale feature extraction on the image to be defogged to obtain a multiscale feature map; performing feature aggregation on the multi-scale feature map to obtain an aggregated feature map; gating and enhancing the aggregation feature map to obtain an enhanced feature map; and adding the enhanced feature map and the image to be defogged pixel by pixel to obtain the defogged image.

Description

Multi-scale image defogging method, system, equipment and storage medium

Technical Field

The present invention relates to the field of image defogging technology, and in particular, to a multi-scale image defogging method, system, device and storage medium.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Image defogging is a pre-processing step for other visual tasks, aimed at defogging the image and restoring a clear, foggy scene through a given foggy day. The atmospheric environment contains a large amount of floating particles such as smoke, dust, fog and the like, and the atmospheric particles can disperse and absorb light, so that the albedo of a scene is reduced, and the problems of limited visibility, low color saturation, loss of details and the like of an image captured by an imaging sensor from a foggy scene inevitably occur. Therefore, how to recover high-quality haze-free images with balanced brightness, abundant details and clear edges from degraded images in foggy days, and provide high-quality input for downstream computer vision tasks and systems, has become a research hotspot in the field of computer vision.

McCartney in 1977 described the formation process of foggy images for the first time in detail, and proposed an atmospheric degradation model based on an attenuation model and an ambient light model, the formation process was as follows:

I(x)＝J(x)t(x)+A(1-t(x)) (1)

Wherein I (x) is an acquired foggy-day low-quality image, J (x) is a clear foggy-free image, t (x) represents transmissivity, A represents global atmospheric light, and x represents pixel coordinates;

wherein, the t (x) transmittance is affected by the depth of field, and can be expressed as:

t(x)＝e^-(βd(x)) (2)

where d (x) represents the depth of field distance between the object and the camera and β represents the light attenuation coefficient. Therefore, the haze-free image can be recovered by accurately estimating the atmospheric light value a and the transmittance t (x).

Based on the thought, researchers at home and abroad explore a plurality of novel image defogging methods (namely model driving methods) by means of different models and priori knowledge, and obtain good defogging performance. These methods, while producing images with good visibility, may introduce artifacts in areas that do not meet the prior.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a multi-scale image defogging method, a system, equipment and a storage medium;

In one aspect, a multi-scale image defogging method is provided, comprising:

Acquiring an image to be defogged;

Inputting an image to be defogged into a trained multi-scale image defogging network, and outputting the defogged image; the trained multiscale image defogging network is used for carrying out multiscale feature extraction on the image to be defogged to obtain a multiscale feature map; performing feature aggregation on the multi-scale feature map to obtain an aggregated feature map; gating and enhancing the aggregation feature map to obtain an enhanced feature map; and adding the enhanced feature map and the image to be defogged pixel by pixel to obtain the defogged image.

In another aspect, a multi-scale image defogging module is provided, comprising:

An acquisition module configured to: acquiring an image to be defogged;

A defogging module configured to: inputting an image to be defogged into a trained multi-scale image defogging network, and outputting the defogged image; the trained multiscale image defogging network is used for carrying out multiscale feature extraction on the image to be defogged to obtain a multiscale feature map; performing feature aggregation on the multi-scale feature map to obtain an aggregated feature map; gating and enhancing the aggregation feature map to obtain an enhanced feature map; and adding the enhanced feature map and the image to be defogged pixel by pixel to obtain the defogged image.

In still another aspect, there is provided an electronic device including:

a memory for non-transitory storage of computer readable instructions; and

A processor for executing the computer-readable instructions,

Wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect described above.

In yet another aspect, there is also provided a storage medium non-transitory storing computer readable instructions, wherein the instructions of the method of the first aspect are executed when the non-transitory computer readable instructions are executed by a computer.

In a further aspect, there is also provided a computer program product comprising a computer program for implementing the method of the first aspect described above when run on one or more processors.

The technical scheme has the following advantages or beneficial effects:

The invention provides a local transducer-based multiscale image defogging network MIDNet, which comprehensively extracts image features of different levels, and processes foggy day images with uniform/non-uniform haze by utilizing local information in a window and a remote relation among pixels. The model can effectively reduce the space resource consumption of the original ViT, realize simple but efficient defogging of a single image, and ensure the visual consistency of the reconstructed image and the truth image.

The invention designs a top-down characteristic aggregation module based on dense connection for the decoding process to aggregate characteristic information of different scales in the decoding process, and simultaneously fuses the characteristics of the same scale in the encoding process, thereby realizing remarkable enhancement of the characteristics.

The invention designs a gating enhancement module, which utilizes different weights to carry out pixel-level assignment to enhance the characteristic information such as edges, textures and the like, thereby realizing the detail preservation of the reconstructed image.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a method according to a first embodiment;

FIG. 2 is a diagram showing the internal structure of a multi-scale image defogging network according to the first embodiment;

fig. 3 is an internal structure diagram of a feature extraction module and a feature aggregation module according to the first embodiment;

FIG. 4 is a partial transducer layer internal structure diagram of the first embodiment;

fig. 5 is an internal structure diagram of a gating enhancement module according to the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

In recent years, with the update of computer software and the update of hardware, a learning-based image defogging method (i.e., a data driving method) has emerged. The method learns network parameters by designing a specific network architecture and paired foggy day data sets so as to obtain a better defogging effect. A few convolutional neural network (convolutional neural networks, CNN) based learning models estimate parameters in the atmospheric degradation model in an end-to-end manner to recover a haze-free image. Estimating the transmittance map t (x) by CNN and suppressing noise as DehazeNet; AODNet the atmospheric light value A and the transmissivity t (x) are jointly estimated, and good defogging performance is obtained. However, most CNN-based methods acquire an image without fog by learning a mapping relationship between the image without fog and the image with fog, and under the constraint of a loss function. However, since CNN-based methods have difficulty in effectively utilizing global information, the defogging results obtained therefrom tend to be unsatisfactory. In particular, for non-uniform haze images, the above method cannot remove the effect of haze in the image well due to irregularities in the haze distribution and differences in the haze concentration thereof.

Recently, transformers have shown great potential in the field of artificial intelligence applications. Initially, researchers applied transformers to the field of natural language processing (natural language processing, NLP) and achieved excellent results in this field. Thus, inspired by the above, researchers extended it to computer vision tasks, presented vision transducers (vision transformer, viT), and made breakthroughs in vision fields such as object detection and image deblurring. However, the original visual transducer grows in square with the increase of the spatial resolution of the input image, and if the original visual transducer is used for visual tasks such as image defogging, the efficiency of the original visual transducer is greatly reduced, and the actual industrial requirement cannot be met.

The present embodiment proposes a novel multi-scale image defogging network MIDNet that processes foggy day images of uniform and non-uniform haze by comprehensively extracting and aggregating multi-source multi-level features through local Transfomer and dense connections. This is one attempt to use Transforemr in processing uniform and non-uniform haze foggy images.

Example 1

The embodiment provides a multi-scale image defogging method;

as shown in fig. 1, the multi-scale image defogging method includes:

s101: acquiring an image to be defogged;

S102: inputting an image to be defogged into a trained multi-scale image defogging network, and outputting the defogged image; the trained multiscale image defogging network is used for carrying out multiscale feature extraction on the image to be defogged to obtain a multiscale feature map; performing feature aggregation on the multi-scale feature map to obtain an aggregated feature map; gating and enhancing the aggregation feature map to obtain an enhanced feature map; and adding the enhanced feature map and the image to be defogged pixel by pixel to obtain the defogged image.

Further, as shown in fig. 2, the trained multiscale image defogging network comprises:

the device comprises an input layer, a feature extraction module, a feature aggregation module, a gate control enhancement module, a pixel-by-pixel addition module and an output layer which are connected in sequence; the input end of the pixel-by-pixel adding module is also connected with the output end of the input layer.

As can be seen from fig. 2, the multi-scale image defogging network is composed of a feature extraction module, a feature aggregation module and a gating enhancement module.

The multi-scale image defogging network first projects an input image I _i∈R³ ^×H×W through a block embedding (patch-embedding) operation onto an embedded vector of dimension d (d=256 in the multi-scale image defogging network), each embedded image having the shape of(P is the block size of the block embedding, p=4 in a multi-scale image defogging network); secondly, the feature extraction module extracts multi-scale features, and the feature aggregation module aggregates features with different scales (namely multi-source multi-level); the gating enhancement module restores the edge information.

Further, as shown in fig. 3, the feature extraction module includes:

a block embedding layer, a first partial transducer layer, a second partial transducer layer, a third partial transducer layer, and a fourth partial transducer layer connected in sequence;

the input of the block embedding layer is the input of the feature extraction module.

Further, the block embedding layer is configured to divide an image to be defogged into a plurality of blocks of a set size, and represent each block as a vector.

Further, the feature extraction module is used for realizing multi-scale feature extraction of the image to be defogged.

It should be appreciated that since the original ViT performs multi-head attention at all spatial positions, its computational complexity of processing an image of size h×w is o=2× (HW) 2d. The computational complexity of raw ViT increases square times with increasing image resolution, and raw ViT is less efficient and takes up relatively more resources when it processes high resolution images. At the same time, single scale feature representations have certain limitations. Thus, during the encoding phase, the present invention replaces all of the convolutional layers in the FPN encoder with partial fransformer layers.

Further, as shown in fig. 3, the feature aggregation module includes:

the device comprises a first upsampling layer, a first pixel-by-pixel adding module, a second upsampling layer, a second pixel-by-pixel adding module, a first feature cascade layer, a first convolution layer, a third upsampling layer, a third pixel-by-pixel adding module, a second feature cascade layer and a second convolution layer which are sequentially connected;

The input end of the first upsampling layer is connected with the output end of the fourth local fransformer layer through a third convolution layer; the input end of the first pixel-by-pixel adding module is connected with the output end of the third local transducer layer through the fourth convolution layer; the input end of the second pixel-by-pixel adding module is connected with the output end of the second local transducer layer through a fifth convolution layer; the input end of the third pixel-by-pixel adding module is connected with the output end of the first local transducer layer through a sixth convolution layer;

The input end of the first characteristic cascade layer is connected with the output end of the third convolution layer through a fourth upsampling layer; the input end of the second characteristic cascade layer is connected with the output end of the third convolution layer through a fifth upsampling layer; the input end of the second characteristic cascade layer is connected with the output end of the first pixel-by-pixel addition module through a sixth upsampling layer;

the output of the second convolution layer is the output of the feature aggregation module.

Further, the functions of the first feature cascade layer and the second feature cascade layer are the same, and the first feature cascade layer and the second feature cascade layer are used for splicing input features in the channel dimension.

Further, the first, second and third upsampling layers are each configured to perform double upsampling.

Further, the fourth upsampling layer is used for realizing four times upsampling, the fifth upsampling layer is used for realizing eight times upsampling, and the sixth upsampling layer is used for realizing four times upsampling.

Further, the feature aggregation module is used for realizing aggregation of features with different scales.

It should be understood that, in feature aggregation, the invention adopts a global strategy, and adds dense connection operation on the basis of the original FPN feature aggregation mode. Since the add operation loses some of the original feature information at feature fusion, the cascade operation is a lossless operation in a strict sense. The dense connection takes the characteristic information of each layer as the input of the next layer through cascading operation, aggregates the characteristics of different layers, and realizes the characteristic reuse of different scales. Therefore, the feature aggregation module designed by the invention fuses the features of the same level in the encoding process and utilizes dense connection to aggregate the features of different levels in the decoding process from top to bottom. Through the global strategy, the multi-source multi-level characteristic information can be more comprehensively aggregated, and further, a clearer defogging image is reconstructed.

Further, as shown in fig. 4, the internal structures of the first local transducer layer, the second local transducer layer, the third local transducer layer, and the fourth local transducer layer are the same, and the first local transducer layer includes:

The system comprises a first normalization operation layer, a window-based multi-head self-attention mechanism layer, a fourth pixel-by-pixel addition module, a second normalization operation layer, a first multi-layer perceptron, a fifth pixel-by-pixel addition module, a third normalization operation layer, a moving window-based multi-head self-attention mechanism layer, a sixth pixel-by-pixel addition module, a fourth normalization operation module, a second multi-layer perceptron and a seventh pixel-by-pixel addition module which are sequentially connected in series;

The input end of the fourth pixel-by-pixel adding module is also connected with the input end of the first normalization operation layer; the input end of the fifth pixel-by-pixel adding module is also connected with the input end of the second normalization operation layer; the input end of the sixth pixel-by-pixel adding module is also connected with the input end of the third normalization operation layer; the input end of the seventh pixel-by-pixel adding module is also connected with the input end of the fourth normalization operation layer.

Further, the window-based multi-head self-attention mechanism layer W-MSA (window based self-attention), the network structure includes: window division, multi-head self-attention and splicing modules which are connected in sequence.

Further, the window-based multi-head self-attention mechanism layer W-MSA comprises the following working processes: firstly, dividing an input sequence into a plurality of windows with fixed sizes by window dividing operation; secondly, the subsequence in each window is sent to a plurality of attention heads for attention calculation; the pixel points in each window can only be subjected to inner product with other pixel points in the current window so as to obtain information; and finally, splicing the results obtained by calculating each attention head through cascading operation to obtain the final attention representation.

Further, the multi-head self-attention mechanism layer SW-MSA (shifted window based self-attention) based on the moving window comprises the following network structures: window partitioning, multi-headed self-attention, cross-window self-attention, and cascading.

Further, the multi-head self-attention mechanism layer SW-MSA based on the moving window comprises the following working processes: firstly, dividing an input sequence into a plurality of windows with fixed sizes by window dividing operation; secondly, the subsequence in each window is sent to a plurality of self-attention heads for attention calculation; then, the cross-window self-attention carries out weighted summation on the self-attention representation in each window and the self-attention representations in other windows, and self-attention calculation is carried out again to obtain the cross-window self-attention representation; and finally, splicing all window-crossing self-attention representations through cascading operation to obtain a final attention representation.

Further, the first, second, third, fourth, fifth, sixth and seventh pixel-by-pixel adding modules are used for adding pixel values of corresponding pixel points of the two input feature images, and the added pixel values are used as pixel values of corresponding pixel points of the output feature images.

Further, the first multi-layer perceptron and the second multi-layer perceptron are both used for carrying out nonlinear transformation on the input vector.

In the local transducer layer, the feature map is divided into a plurality of disjoint window areas, and a self-attention operation is performed in each window to acquire local information in the local window. Meanwhile, a movable window (movable window) is utilized to establish the connection among the local windows, the remote relation among the pixels is obtained, and then the characteristics of the input foggy day images are rapidly and comprehensively extracted, so that a strong characteristic foundation is provided for processing foggy day images with uniform/non-uniform haze.

The local transducer layer consists of, among other things, a multi-layer perceptron (multilayer perceptron, MLP), normalization (layernorm, LN), residual connection, and local window-based multi-head self-attention, MHSA.

The local transform layer performs self-attention within the local window to keep the linear computation costs, which are more efficient and take less resources.

The local transducer layer operates as follows: given an input feature map X, X is projected to a self-attention Q (Query), K (Key) and V (Value) matrix using a linear layer, and labels are grouped using window partitioning; the local transducer layer applies multi-head attention in the window, and window divisions of adjacent blocks are different; thus, self-attention can be calculated by:

Where d represents the number of channels, B is the relative position deviation term, softMax is the normalization function. A linear layer projects an attention output based on it.

Further, as shown in fig. 5, the gating enhancement module includes:

The fifth normalization operation layer, the seventh convolution layer, the first activation function layer, the eighth convolution layer, the sixth normalization operation layer, the second activation function layer, the gate control unit and the eighth pixel-by-pixel addition module are sequentially connected;

The input end of the gate control unit is also connected with the input end of the fifth normalization operation layer; the input end of the eighth pixel-by-pixel adding module is also connected with the input end of the fifth normalization operation layer.

Further, the gate control enhancing module comprises the following working processes: a fifth normalization operation layer normalizes the input features; the seventh convolution layer carries out linear transformation on the output characteristics of the fifth normalization layer; the first activation function layer retains the positive value output by the seventh convolution layer; the eighth convolution layer carries out linear transformation on the output characteristics of the first activation function layer; the sixth normalization operation layer normalizes the output of the eighth convolution layer; the second activation function layer maps the output of the sixth normalization layer to generate a gating value between 0 and 1; the gating unit selectively reserves or suppresses information in the feature map according to the gating value; and the eighth pixel-by-pixel adding module adds the output of the gating unit and the input of the fifth normalization operation layer pixel by pixel to obtain a final output.

Further, as shown in fig. 5, the gating enhancement module is configured to: and enhancing detail information such as edges in the image.

It should be understood that details such as contours and edges of the image are important structural information, so in order to obtain a defogging image with rich details and clear edges, at the end of the decoding process, the invention designs a gating enhancement module, which enhances details such as edges of the image by giving greater weight to the details such as edges and utilizing a pixel-by-pixel multiplication method.

The gating enhancement module is shown in fig. 2, and after a series of 1×1 convolution, reLU nonlinear activation and Norm normalization operations are performed on the input feature map, a weighting map is generated by using a Sigmoid activation function layer.

And then, performing dot multiplication operation on the weight map and an input feature map (namely, the output of the feature aggregation module), and fusing the input feature map by utilizing residual design and jump connection so as to focus on detail information such as edges and the like, thereby recovering defogging images with rich details and clear edges.

The formula of the enhancement unit is expressed as follows:

Wherein X and Representing the input and output feature maps, respectively, sigmoid is a Sigmoid activation function, norm is batch normalization, RELU represents ReLU nonlinear activation, conv is a1 x1 convolution.

It should be appreciated that the present invention proposes a local transducer-based multiscale image defogging network, referred to as MIDNet (Multi-SCALE IMAGE Dehazing Network based on local Transformer). This is one attempt to use Transforemr in processing uniform and non-uniform haze foggy images.

In the encoding process, different from the CNN which only focuses on local features, the multiscale image defogging network uses a multiscale feature extractor based on local Transformer to extract features of different levels more comprehensively by utilizing local information in a local window and a remote relation among pixels of the Transformer, so that the multiscale image defogging network can process foggy images with uniform and non-uniform haze. Compared with the prior method, the multi-scale image defogging network can obtain global information and simultaneously ensure the extraction efficiency of the features.

In the decoding process, the multi-scale image defogging network combines the pyramid structure from top to bottom and dense connection (dense connections, DC) operation, not only fuses the characteristics of the same scale in the encoding process, but also fuses the characteristics of different scales in the decoding process better, and realizes the comprehensive fusion enhancement of multi-source (encoding process and decoding process) multi-level (different scales) characteristics.

At the end of the network, the multi-scale image defogging network gives greater weight to details such as edges and the like through a gating enhancement module so as to acquire defogging images with rich details. The present invention conducted a number of experiments on the RESIDE, I-HAZE, O-HAZE, NH-HAZE and NITER public data sets, and compared with the most advanced method (SOTA), the proposed MIDNet model achieved better defogging performance.

Further, the trained multiscale image defogging network, and the training process comprises the following steps:

constructing a training set, wherein the training set is an original image of a known defogging image;

And inputting the training set into a multi-scale image defogging network, training the multi-scale image defogging network, and stopping training when the total loss function value of the network is not reduced any more, so as to obtain the trained multi-scale image defogging network.

Further, the total loss function is expressed as:

where ω ₁、ω₂、ω₃ is a hyper-parameter, L represents the total loss function, Representing a smooth loss function, L _vgg representing a perceptual loss function, and L _ms-ssim representing a multi-scale structural similarity loss function.

Further, the smoothing loss function is expressed as:

Wherein i represents pixels, N is the total number of pixels, Y is the defogging image, and Y is the truth image.

It should be appreciated that many image restoration tasks trained using L1 loss achieve better performance than L2 loss in terms of PSNR and SSIM metrics. The smooth L1 loss converges fast and the gradient change is relatively smaller. Thus, the present invention employs a smooth L1 penalty to ensure that the predicted image is close to the real image.

Further, the perceptual loss function is expressed as:

Wherein Y and Y represent defogging images and truth images respectively, C _i represents a channel, and H _i and W _i are the height and width of an ith feature map respectively; Representing a characterization map of the size C _i×H_i×W_i of VGG-16 pre-trained on ImageNet.

It will be appreciated that in order to maintain perceptual and semantic fidelity, and better reconstruct and recover detailed information, the present invention exploits perceptual loss to provide additional supervision in the high-level feature space to measure high-level feature differences between blurred images and their corresponding defogged images.

Further, the multi-scale structural similarity loss function includes:

Wherein Y and Y respectively represent defogging images and truth images, M represents different scales, SSIM (Y, Y) is a structural similarity loss, and human visual perception (brightness, contrast, structure and the like) is considered;

The structural similarity loss function has the expression:

where, gaussian filtering is applied to Y and Y, the mean values of the calculation results are μ _Y and μ _y, the standard deviations are σ _Y and σ _y, and the covariances are σ _Yy,C₁ and C ₂ are constants for maintaining stability.

It should be appreciated that the multiscale structural similarity penalty considers both human visual perception and resolution (multiscale), with the range of values for SSIM being [0,1]; in order to preserve the structure of the defogged image, the present invention uses a multi-scale structural similarity penalty to measure the structural similarity between the defogged image and the truth image.

Example two

The embodiment provides a multi-scale image defogging system;

a multi-scale image defogging module comprising:

An acquisition module configured to: acquiring an image to be defogged;

Here, it should be noted that the above-mentioned obtaining module and defogging module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The proposed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative, such as the division of the modules described above, are merely a logical function division, and may be implemented in other manners, such as multiple modules may be combined or integrated into another system, or some features may be omitted, or not performed.

Example III

The embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein the processor is coupled to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example IV

The present embodiment also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, perform the method of embodiment one.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-scale image defogging method, characterized by comprising:

Obtain an image to be dehazed;

The image to be defogged is input into the trained multi-scale image defogging network, and the defogged image is output; wherein the trained multi-scale image defogging network is used to perform multi-scale feature extraction on the image to be defogged to obtain a multi-scale feature map; perform feature aggregation on the multi-scale feature map to obtain an aggregated feature map; perform gated enhancement on the aggregated feature map to obtain an enhanced feature map; and perform pixel-by-pixel addition of the enhanced feature map and the image to be defogged to obtain the defogged image;

Among them, the trained multi-scale image dehazing network includes:

An input layer, a feature extraction module, a feature aggregation module, a gate enhancement module, a pixel-by-pixel addition module and an output layer connected in sequence; the input end of the pixel-by-pixel addition module is also connected to the output end of the input layer;

Wherein, the feature extraction module includes:

The block embedding layer, the first local Transformer layer, the second local Transformer layer, the third local Transformer layer, and the fourth local Transformer layer are connected in sequence;

The input of the block embedding layer is the input of the feature extraction module;

The block embedding layer is used to divide the image to be defogged into a plurality of blocks of a set size, and represent each block as a vector;

The feature extraction module is used to realize multi-scale feature extraction of the image to be defogged;

Wherein, the feature aggregation module includes:

A first upsampling layer, a first pixel-by-pixel addition module, a second upsampling layer, a second pixel-by-pixel addition module, a first feature cascade layer, a first convolutional layer, a third upsampling layer, a third pixel-by-pixel addition module, a second feature cascade layer, and a second convolutional layer connected in sequence;

The input end of the first upsampling layer is connected to the output end of the fourth local Transformer layer through the third convolutional layer; the input end of the first pixel-by-pixel addition module is connected to the output end of the third local Transformer layer through the fourth convolutional layer; the input end of the second pixel-by-pixel addition module is connected to the output end of the second local Transformer layer through the fifth convolutional layer; the input end of the third pixel-by-pixel addition module is connected to the output end of the first local Transformer layer through the sixth convolutional layer;

The input end of the first feature cascade layer is connected to the output end of the third convolution layer through the fourth upsampling layer; the input end of the second feature cascade layer is connected to the output end of the third convolution layer through the fifth upsampling layer; the input end of the second feature cascade layer is connected to the output end of the first pixel-by-pixel addition module through the sixth upsampling layer;

The output of the second convolutional layer is the output of the feature aggregation module;

Wherein, the gate control enhancement module includes:

A fifth normalization operation layer, a seventh convolution layer, a first activation function layer, an eighth convolution layer, a sixth normalization operation layer, a second activation function layer, a gating unit and an eighth pixel-by-pixel addition module connected in sequence; wherein the input end of the gating unit is also connected to the input end of the fifth normalization operation layer; the input end of the eighth pixel-by-pixel addition module is also connected to the input end of the fifth normalization operation layer;

The gated enhancement module has the following working process: the fifth normalization operation layer normalizes the input features; the seventh convolution layer linearly transforms the output features of the fifth normalization layer; the first activation function layer retains the positive value of the output of the seventh convolution layer; the eighth convolution layer linearly transforms the output features of the first activation function layer; the sixth normalization operation layer normalizes the output of the eighth convolution layer; the second activation function layer maps the output of the sixth normalization layer to generate a gating value between 0 and 1; the gating unit selectively retains or suppresses the information in the feature map according to the gating value; the eighth pixel-by-pixel addition module adds the output of the gating unit and the input of the fifth normalization operation layer pixel by pixel to obtain the final output;

The multi-scale image dehazing network first projects the input image I _i ∈ R ^3×H×W to an embedding vector of dimension d through a block embedding operation. The shape of each embedded image is Secondly, the feature extraction module extracts multi-scale features, the feature aggregation module aggregates features of different scales; and the gated enhancement module restores edge information.

2. The multi-scale image defogging method according to claim 1, wherein the internal structures of the first local Transformer layer, the second local Transformer layer, the third local Transformer layer and the fourth local Transformer layer are the same, and the first local Transformer layer comprises:

A first normalization operation layer, a window-based multi-head self-attention mechanism layer, a fourth pixel-by-pixel addition module, a second normalization operation layer, a first multi-layer perceptron, a fifth pixel-by-pixel addition module, a third normalization operation layer, a moving window-based multi-head self-attention mechanism layer, a sixth pixel-by-pixel addition module, a fourth normalization operation module, a second multi-layer perceptron, and a seventh pixel-by-pixel addition module connected in series in sequence;

The input end of the fourth pixel-by-pixel addition module is also connected to the input end of the first normalization operation layer; the input end of the fifth pixel-by-pixel addition module is also connected to the input end of the second normalization operation layer; the input end of the sixth pixel-by-pixel addition module is also connected to the input end of the third normalization operation layer; the input end of the seventh pixel-by-pixel addition module is also connected to the input end of the fourth normalization operation layer.

3. The multi-scale image defogging method according to claim 1, wherein the training process of the trained multi-scale image defogging network comprises:

Constructing a training set, wherein the training set is an original image of a known dehazed image;

The training set is input into the multi-scale image dehazing network for training. When the total loss function value of the network no longer decreases, the training is stopped to obtain the trained multi-scale image dehazing network.

The total loss function is expressed as:

Among them, ω ₁ , ω ₂ , ω ₃ are hyperparameters, L represents the total loss function, represents the smoothing loss function, L _vgg represents the perceptual loss function, and L _ms-ssim represents the multi-scale structural similarity loss function;

Smooth loss function, expression is:

Where i represents the pixel, N is the total number of pixels, Y is the dehazed image, and y is the true value image;

Perceptual loss function, expressed as:

Among them, Y and y represent the dehazed image and the true value image respectively, _Ci represents the channel, _Hi and _Wi are the height and width of the i-th feature map respectively; Represents the feature map of size C _i ×H _i ×W _i of VGG-16 pre-trained on ImageNet;

The multi-scale structural similarity loss function includes:

Among them, Y and y represent the dehazed image and the true image respectively, M represents different scales, and SSIM(Y, y) is the structural similarity loss, which takes into account human visual perception;

The structural similarity loss function is expressed as:

Among them, Gaussian filtering is applied to Y and y, the mean of the calculated results is μ _Y and μ _y , the standard deviation is σ _Y and σ _y , the covariance is σ _Yy , and C ₁ and C ₂ are constants used to maintain stability.

4. A multi-scale image defogging system, characterized by comprising:

An acquisition module is configured to: acquire an image to be defogged; a defogging module is configured to: input the image to be defogged into a trained multi-scale image defogging network, and output a defogged image; wherein the trained multi-scale image defogging network is used to perform multi-scale feature extraction on the image to be defogged to obtain a multi-scale feature map; perform feature aggregation on the multi-scale feature map to obtain an aggregated feature map; perform gated enhancement on the aggregated feature map to obtain an enhanced feature map; and perform pixel-by-pixel addition of the enhanced feature map and the image to be defogged to obtain a defogged image;

Among them, the trained multi-scale image dehazing network includes:

Wherein, the feature extraction module includes:

Wherein, the feature aggregation module includes:

The input end of the first feature cascade layer is connected to the output end of the third convolutional layer through the fourth upsampling layer;

The input end of the second feature cascade layer is connected to the output end of the third convolution layer through the fifth upsampling layer; the input end of the second feature cascade layer is connected to the output end of the first pixel-by-pixel addition module through the sixth upsampling layer;

Wherein, the gate control enhancement module includes:

5. An electronic device, comprising:

a memory for non-transitory storage of computer-readable instructions; and

a processor for executing the computer readable instructions,

When the computer-readable instructions are executed by the processor, the method described in any one of claims 1 to 3 is executed.

6. A storage medium, characterized in that it non-temporarily stores computer-readable instructions, wherein when the non-temporary computer-readable instructions are executed by a computer, the instructions of the method described in any one of claims 1 to 3 are executed.