CN115018750B

CN115018750B - Medium-wave infrared hyperspectral and multispectral image fusion method, system and medium

Info

Publication number: CN115018750B
Application number: CN202210941183.1A
Authority: CN
Inventors: 李树涛; 冯辰果; 刘海波; 佃仁伟
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-08-08
Filing date: 2022-08-08
Publication date: 2022-11-08
Anticipated expiration: 2042-08-08
Also published as: CN115018750A

Abstract

The invention discloses a mid-wave infrared hyperspectral and multi-spectral image fusion method, system and medium. The invention comprises spatially up-sampling an input mid-wave infrared hyperspectral image Y to obtain an up-sampled mid-wave infrared hyperspectral image Y ^U ; Splicing the input mid-wave infrared multispectral image Z and the up-sampled mid-wave infrared hyperspectral image Y ^U according to the spectral dimension to obtain an image block C; extracting the residual image X _res of the image block C _; The position-based pixel values of the up-sampled mid-wave infrared hyperspectral image Y ^U are added to obtain the fused mid-wave infrared hyperspectral image X. The invention can effectively realize the fusion of a low-resolution mid-wave infrared hyperspectral image and a high-resolution mid-wave infrared multi-spectral image to obtain a high-resolution mid-wave infrared hyperspectral image, and has the advantages of high reconstruction accuracy, high computational efficiency, and general availability. The advantages of adaptability and robustness are strong.

Description

Mid-wave infrared hyperspectral and multispectral image fusion method, system and medium

技术领域technical field

本发明涉及中波红外高光谱和中波红外多光谱图像融合技术，具体涉及一种中波红外高光谱及多光谱图像融合方法、系统及介质。The invention relates to mid-wave infrared hyperspectral and mid-wave infrared multi-spectral image fusion technology, in particular to a mid-wave infrared hyperspectral and multi-spectral image fusion method, system and medium.

背景技术Background technique

全色、RGB等传统光学图像由于仅具有较低的光谱分辨率，导致目标识别与分类的有效性研究陷入瓶颈阶段。为了解决传统光学图像“看不准”问题，高光谱成像技术应运而生。高光谱遥感图像具有光谱连续、光谱分辨率高、光谱信息丰富以及图谱合一等特点，极大地提高了相关图像应用技术的精度与可靠性。Due to the low spectral resolution of traditional optical images such as panchromatic and RGB, the research on the effectiveness of target recognition and classification has fallen into a bottleneck stage. In order to solve the "inaccurate" problem of traditional optical images, hyperspectral imaging technology came into being. Hyperspectral remote sensing images have the characteristics of continuous spectrum, high spectral resolution, rich spectral information, and map-spectrum integration, which greatly improves the accuracy and reliability of related image application technologies.

相比于可见光和短波红外波段，在中波红外波段进行高光谱遥感研究具有明显的优势，目前的热红外影像探测技术能够有效地将热辐射能转化为人眼可见的光谱影像，有助于更有效地识别地物、分辨目标。此外中波红外高光谱成像技术具有日夜监听能力，能够检测化学气体、识别地物、探测汽车尾气等，也可广泛应用于林火监测、旱灾监测、城市热岛效应、探矿、探地热等领域。然而由于成像硬件和光学原理的限制，中波红外高光谱图像的空间分辨率和光谱分辨率相互制约，具有高光谱分辨率的中波红外图像往往具有较低的空间分辨率，这降低了中波红外高光谱图像的潜在应用价值。将同一场景下低空间分辨率的中波红外高光谱图像和高空间分辨率的中波红外多光谱图像进行融合是获得高分辨率中波红外高光谱图像的有效方式，因此研究高效、精度高的中波红外高光谱和多光谱图像融合方法是非常有必要的。Compared with visible light and short-wave infrared bands, hyperspectral remote sensing research in mid-wave infrared bands has obvious advantages. The current thermal infrared image detection technology can effectively convert thermal radiation energy into spectral images visible to human eyes, which contributes to more Effectively identify ground objects and distinguish targets. In addition, mid-wave infrared hyperspectral imaging technology has the ability to monitor day and night, and can detect chemical gases, identify ground objects, and detect vehicle exhaust. However, due to the limitations of imaging hardware and optical principles, the spatial resolution and spectral resolution of mid-wave infrared hyperspectral images are mutually restricted. Mid-wave infrared images with high spectral resolution often have low spatial resolution, which reduces the Potential application value of wave-infrared hyperspectral images. Fusion of mid-wave infrared hyperspectral images with low spatial resolution and high spatial resolution mid-wave infrared multispectral images in the same scene is an effective way to obtain high-resolution mid-wave infrared hyperspectral images, so research is efficient and accurate The mid-wave infrared hyperspectral and multispectral image fusion methods are very necessary.

针对中波红外高光谱图像空间分辨率较低的关键问题，国内外学者提出了大量的中波红外高光谱和多光谱图像融合方法。一般来说，中波红外高光谱和多光谱图像融合方法可分为四类，即基于全色锐化的方法、基于矩阵分解的方法、基于张量表示的方法以及基于深度学习的方法。其中，基于全色锐化的方法具有计算效率高，计算量小等优点，但是当中波红外多光谱图像与高光谱图像的空间分辨率相差较大时，产生的融合图像往往会有较大的失真；基于矩阵分解的融合方法具有较高的融合精度，但在其求解过程中由于需要解决复杂的优化问题，导致计算量较大，计算效率较低；张量分解方法也拥有较高的融合精度，但与矩阵分解方法类似，需要解决复杂的优化问题，具有较高的计算代价，当面对海量图像数据时，无法满足融合需求；当训练数据集的数量足够时，基于深度卷积神经网络的融合方法一般会取得优异的融合性能，但是由于卷积核的感受野有限，基于深度卷积神经网络的中波红外高光谱图像融合方法只考虑了局部邻域像素的关系，忽略了特征图中的全局关系，导致随着网络层次的深入，原中波红外高光谱图像的空间结构信息逐渐丢失，这给基于卷积神经网络的中波红外高光谱和多光谱图像融合方法留下了进一步改进的空间。Aiming at the key problem of low spatial resolution of mid-wave infrared hyperspectral images, scholars at home and abroad have proposed a large number of mid-wave infrared hyperspectral and multispectral image fusion methods. In general, MWIR hyperspectral and multispectral image fusion methods can be classified into four categories, namely, panchromatic sharpening-based methods, matrix factorization-based methods, tensor representation-based methods, and deep learning-based methods. Among them, the method based on panchromatic sharpening has the advantages of high computational efficiency and small amount of calculation, but when the spatial resolution of the mid-wave infrared multispectral image and the hyperspectral image differ greatly, the resulting fusion image often has a large Distortion; the fusion method based on matrix decomposition has high fusion accuracy, but due to the need to solve complex optimization problems in the solution process, it results in a large amount of calculation and low calculation efficiency; the tensor decomposition method also has a high fusion accuracy. Accuracy, but similar to the matrix decomposition method, it needs to solve complex optimization problems and has a high computational cost. When faced with massive image data, it cannot meet the fusion requirements; when the number of training data sets is sufficient, based on deep convolutional neural networks The network fusion method generally achieves excellent fusion performance, but due to the limited receptive field of the convolution kernel, the mid-wave infrared hyperspectral image fusion method based on the deep convolutional neural network only considers the relationship between local neighborhood pixels and ignores the feature The global relationship in the figure leads to the gradual loss of the spatial structure information of the original mid-wave infrared hyperspectral image as the network level deepens, which leaves a gap for the fusion method of mid-wave infrared hyperspectral and multispectral images based on convolutional neural networks. Room for further improvement.

发明内容Contents of the invention

本发明要解决的技术问题：针对现有技术的上述问题，提供一种中波红外高光谱及多光谱图像融合方法、系统及介质，本发明能够有效实现低分辨率的中波红外高光谱图像和高分辨率的中波红外多光谱图像融合得到高分辨率的中波红外高光谱图像，具有重构精度高、计算效率高、普适性和鲁棒性较强的优点。The technical problem to be solved by the present invention: Aiming at the above problems of the prior art, a method, system and medium for mid-wave infrared hyperspectral and multi-spectral image fusion are provided. The present invention can effectively realize low-resolution mid-wave infrared hyperspectral images The high-resolution mid-wave infrared hyperspectral image is fused with the high-resolution mid-wave infrared multispectral image, which has the advantages of high reconstruction accuracy, high computational efficiency, universality and robustness.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the problems of the technologies described above, the technical solution adopted in the present invention is:

一种中波红外高光谱及多光谱图像融合方法，包括：A mid-wave infrared hyperspectral and multispectral image fusion method, comprising:

S1，对输入的中波红外高光谱图像Y空间上采样得到上采样中波红外高光谱图像Y^U；S1, up-sampling the input mid-wave infrared hyperspectral image Y space to obtain an up-sampled mid-wave infrared hyperspectral image Y ^U ;

S2，将输入的中波红外多光谱图像Z与所述上采样中波红外高光谱图像Y^U按光谱维度拼接得到图像块C；S2, splicing the input mid-wave infrared multispectral image Z and the upsampled mid-wave infrared hyperspectral image Y ^U according to the spectral dimension to obtain an image block C;

S3，提取图像块C的残差图像X_res；S3, extracting the residual image X _res of the image block C;

S4，将残差图像X_res、上采样中波红外高光谱图像Y^U基于位置的像素值相加，得到融合后的中波红外高光谱图像X。S4. Add the position-based pixel values of the residual image X _res and the upsampled mid-wave infrared hyperspectral image Y ^U to obtain a fused mid-wave infrared hyperspectral image X.

可选地，步骤S1中的对输入的中波红外高光谱图像Y空间上采样是指对输入的中波红外高光谱图像Y采用双三次插值法进行空间上采样以得到上采样中波红外高光谱图像Y^U。Optionally, the spatial upsampling of the input mid-wave infrared hyperspectral image Y in step S1 refers to spatially upsampling the input mid-wave infrared hyperspectral image Y using bicubic interpolation to obtain the upsampled mid-wave infrared hyperspectral image Y Spectral image Y ^U .

可选地，步骤S3中提取对应的残差图像X_res为通过预先完成训练的基于自注意力机制的融合网络实现的，所述基于自注意力机制的融合网络由相互连接的编码器和解码器组成，所述编码器包括N个依次级联执行下采样的图像合并层，所述解码器包括N个依次级联执行上采样的图像扩展层，且编码器中的图像合并层、解码器中的图像扩展层数量相同且一一对应，任意相邻的图像合并层之间、相邻的图像合并层和图像扩展层之间、以及相邻的图像扩展层之间均串接有用于提取全局特征的旋转变换器块，编码器的前N-1个图像合并层与对应的图像扩展层之间设有跳跃连接以用于将下采样得到的特征图与对应上采样的特征图进行通道方向的拼接后通过全连接层调整拼接特征图的通道维度使通道维度不发生改变。Optionally, extracting the corresponding residual image X _res in step S3 is realized by a pre-trained fusion network based on a self-attention mechanism, and the fusion network based on a self-attention mechanism is composed of interconnected encoders and decoders The encoder consists of N sequentially cascaded image merging layers that perform downsampling, and the decoder includes N sequentially cascaded image extension layers that perform upsampling, and the image merging layer and decoder in the encoder The number of image expansion layers in is the same and corresponds to one-to-one, between any adjacent image combination layers, between adjacent image combination layers and image expansion layers, and between adjacent image expansion layers are connected in series for extracting The rotation transformer block of the global feature, the first N-1 image merging layers of the encoder and the corresponding image expansion layer have a skip connection for channeling the downsampled feature map with the corresponding upsampled feature map After the splicing of the direction, the channel dimension of the spliced feature map is adjusted through the fully connected layer so that the channel dimension does not change.

可选地，每一个所述旋转变换器块之后均对应连接有一个卷积层，所述卷积层用于将卷积结构的归纳偏置性引入旋转变换器块。Optionally, each of the rotation transformer blocks is connected with a corresponding convolution layer, and the convolution layer is used to introduce the inductive bias of the convolution structure into the rotation transformer block.

可选地，所述卷积层的卷积核大小为3×3。Optionally, the size of the convolution kernel of the convolution layer is 3×3.

可选地，每一个所述旋转变换器块均在卷积层之后还对应连接有一个残差模块，所述残差模块用于将该旋转变换器块的输入、该旋转变换器块对应的卷积层的输出做差后输出至下一个图像合并层或者图像扩展层。Optionally, each of the rotation transformer blocks is connected with a corresponding residual module after the convolutional layer, and the residual module is used for the input of the rotation transformer block, the corresponding The output of the convolutional layer is subtracted and then output to the next image combining layer or image expansion layer.

可选地，所述编码器包括3个依次级联执行下采样的图像合并层，所述解码器包括3个依次级联执行上采样的图像扩展层。Optionally, the encoder includes 3 image combining layers sequentially cascaded to perform downsampling, and the decoder includes 3 image expansion layers sequentially cascaded to perform upsampling.

可选地，中波红外高光谱图像Y的大小为W/16*H/16*31，上采样中波红外高光谱图像Y^U的大小为W*H*31，中波红外多光谱图像Z的大小为W*H*3，图像块C的大小为W*H*34，第一个图像合并层输出的特征图大小为W/4*H/4*96，第二个图像合并层输出的特征图大小为W/8*H/8*192，第三个图像合并层输出的特征图大小为W/16*H/16*384，第一个图像扩展层2倍上采样后输出的特征图大小为W/8*H/8*192，第二个图像扩展层2倍上采样后输出的特征图大小为W/4*H/4*96，第三个图像扩展层4倍上采样后通过一个全连接层将特征维度还原为31个光谱维度，输出的特征图大小为W*H*31的残差图像X_res，其中W为残差图像X_res的宽度，H为残差图像X_res的高度。Optionally, the size of the mid-wave infrared hyperspectral image Y is W/16*H/16*31, the size of the upsampled mid-wave infrared hyperspectral image Y ^U is W*H*31, and the mid-wave infrared multispectral image Z The size of the image block C is W*H*3, the size of the image block C is W*H*34, the size of the feature map output by the first image merging layer is W/4*H/4*96, and the output of the second image merging layer is The size of the feature map is W/8*H/8*192, the size of the feature map output by the third image merging layer is W/16*H/16*384, and the output of the first image expansion layer after 2 times upsampling The size of the feature map is W/8*H/8*192, the size of the feature map output after the second image expansion layer is 2 times upsampled is W/4*H/4*96, and the third image expansion layer is 4 times up After sampling, the feature dimension is restored to 31 spectral dimensions through a fully connected layer, and the output feature map size is a residual image X _res of W*H*31, where W is the width of the residual image X _res , and H is the residual The height of the image X _res .

此外，本发明还提供一种中波红外高光谱及多光谱图像融合系统，包括相互连接的微处理器和存储器，所述微处理器被编程或配置以执行所述中波红外高光谱及多光谱图像融合方法的步骤。In addition, the present invention also provides a mid-wave infrared hyperspectral and multi-spectral image fusion system, including interconnected microprocessors and memories, the microprocessor is programmed or configured to perform the mid-wave infrared hyperspectral and multi-spectral image fusion system. Steps of the spectral image fusion method.

此外，本发明还提供一种计算机可读存储介质，其中存储有计算机程序，所述计算机程序用于被微处理器编程或配置以执行所述中波红外高光谱及多光谱图像融合方法的步骤。In addition, the present invention also provides a computer-readable storage medium, in which a computer program is stored, and the computer program is used to be programmed or configured by a microprocessor to perform the steps of the mid-wave infrared hyperspectral and multispectral image fusion method .

和现有技术相比，本发明主要具有下述优点：Compared with the prior art, the present invention mainly has the following advantages:

1、本发明包括对输入的中波红外高光谱图像Y空间上采样得到上采样中波红外高光谱图像Y^U；将输入的中波红外多光谱图像Z与所述上采样中波红外高光谱图像Y^U按光谱维度拼接得到图像块C；提取图像块C的残差图像X_res；将残差图像X_res、上采样中波红外高光谱图像Y^U基于位置的像素值相加，得到融合后的中波红外高光谱图像X，本发明能够有效实现低分辨率的中波红外高光谱图像和高分辨率的中波红外多光谱图像融合得到高分辨率的中波红外高光谱图像，具有重构精度高、计算效率高、普适性和鲁棒性较强等优点。1. The present invention includes spatially upsampling the input mid-wave infrared hyperspectral image Y to obtain an up-sampled mid-wave infrared hyperspectral image Y ^U ; The image Y ^U is spliced according to the spectral dimension to obtain the image block C; the residual image X _res of the image block C is extracted; the residual image X _res and the position-based pixel values of the upsampled mid-wave infrared hyperspectral image Y ^U are added to obtain a fusion After the mid-wave infrared hyperspectral image X, the present invention can effectively realize the fusion of low-resolution mid-wave infrared hyperspectral images and high-resolution mid-wave infrared multispectral images to obtain high-resolution mid-wave infrared hyperspectral images, with It has the advantages of high reconstruction accuracy, high computational efficiency, strong universality and robustness, etc.

2、本发明在对不同类型（场景不同或图像采集设备或采集参数不同等）的中波红外高光谱和中波红外多光谱图像融合时，不需要改变网络的结构，仅需要提前准备好相应类型的中波红外高光谱和中波红外多光谱图像训练融合网络，网络模型训练完成后便可以投入使用，具有很强的普适性和鲁棒性。2. When the present invention fuses mid-wave infrared hyperspectral and mid-wave infrared multispectral images of different types (different scenes or different image acquisition equipment or acquisition parameters, etc.), it does not need to change the network structure, and only needs to prepare corresponding images in advance. A type of mid-wave infrared hyperspectral and mid-wave infrared multispectral image training fusion network, the network model can be put into use after training, and has strong universality and robustness.

3、本发明适用于各种维度不同的中波红外高光谱和中波红外多光谱数据融合，可以获得高质量的中波红外高分辨率高光谱图像，并且拥有抗噪声干扰的能力。3. The present invention is applicable to the fusion of mid-wave infrared hyperspectral and mid-wave infrared multi-spectral data with different dimensions, and can obtain high-quality mid-wave infrared high-resolution hyperspectral images, and has the ability to resist noise interference.

附图说明Description of drawings

图1为本发明实施例方法的基本流程示意图。Fig. 1 is a schematic flow diagram of the basic process of the method of the embodiment of the present invention.

图2为本发明实施例基于自注意力机制的融合网络的结构示意图。FIG. 2 is a schematic structural diagram of a fusion network based on a self-attention mechanism according to an embodiment of the present invention.

图3为本发明实施例方法基于自注意力机制的融合网络的输入和输出大小示意图。Fig. 3 is a schematic diagram of the input and output sizes of the fusion network based on the self-attention mechanism of the method of the embodiment of the present invention.

图4为本发明实施例5种融合方法在CAVE高光谱数据集上的融合结果及误差图像。Fig. 4 is the fusion results and error images of the five fusion methods in the embodiment of the present invention on the CAVE hyperspectral data set.

图5为本发明实施例5种融合方法和Harvard高光谱数据集上的融合结果及误差图像。Fig. 5 shows the fusion results and error images of five fusion methods and Harvard hyperspectral data set according to the embodiment of the present invention.

具体实施方式Detailed ways

实施例一：Embodiment one:

如图1所示，本实施例中波红外高光谱及多光谱图像融合方法包括：As shown in Figure 1, the mid-wave infrared hyperspectral and multispectral image fusion method in this embodiment includes:

步骤S1中的对输入的中波红外高光谱图像Y空间上采样可根据需要采用所需的方法，例如作为一种优选的实施方式，本实施例步骤S1中的对输入的中波红外高光谱图像Y空间上采样是指对输入的中波红外高光谱图像Y采用双三次插值法进行空间上采样以得到上采样中波红外高光谱图像Y^U。The Y spatial upsampling of the input mid-wave infrared hyperspectral image in step S1 can adopt the required method according to needs, for example, as a preferred implementation mode, the input mid-wave infrared hyperspectral image in step S1 of this embodiment The spatial upsampling of the image Y refers to spatially upsampling the input mid-wave infrared hyperspectral image Y using the bicubic interpolation method to obtain the up-sampled mid-wave infrared hyperspectral image Y ^U .

步骤S3提取图像块C的残差图像X_res可根据需要采用所需的深度学习神经网络实现。例如作为一种优选的实施方式，本实施例中步骤S3中提取对应的残差图像X_res为通过预先完成训练的基于自注意力机制的融合网络实现的。Step S3 extracting the residual image X _res of the image block C can be realized by using a required deep learning neural network as required. For example, as a preferred implementation manner, the extraction of the corresponding residual image X _res in step S3 in this embodiment is realized through a pre-trained fusion network based on a self-attention mechanism.

本实施例中基于自注意力机制的融合网络为基于自注意力机制的U-net网络。如图2所示，基于自注意力机制的融合网络由相互连接的编码器和解码器组成。编码器的作用是将输入三维图像块转化为二维向量序列形式的深层特征图，编码器包括N个依次级联执行下采样的图像合并（Patch Merging）层。例如，第一个图像合并层用于将输入的中波红外高光谱图像块分割为一系列没有重叠部分且大小为4*4的图像块，然后对每一个图像块进行4*4的卷积操作，卷积核的数目为96，因此卷积后所得特征图的特征维度为96，最后将图像展开，便得到大小为W/4*H/4*96的二维向量，此二维向量的每一行数据都代表着一个维度的特征信息。需要说明的是，图像合并层为现有网络结构层，可参见现有文献：Liu Z,Lin Y, Cao Y, et al. Swin Transformer: Hierarchical Vision Transformer usingShifted Windows[J]. arXiv preprint arXiv:2103.14030, 2021。The fusion network based on the self-attention mechanism in this embodiment is the U-net network based on the self-attention mechanism. As shown in Figure 2, the fusion network based on the self-attention mechanism consists of an interconnected encoder and decoder. The role of the encoder is to convert the input 3D image block into a deep feature map in the form of a 2D vector sequence. The encoder includes N image merging (Patch Merging) layers that are sequentially cascaded to perform downsampling. For example, the first image merging layer is used to divide the input mid-wave infrared hyperspectral image block into a series of image blocks with no overlapping parts and a size of 4*4, and then perform a 4*4 convolution on each image block Operation, the number of convolution kernels is 96, so the feature dimension of the feature map obtained after convolution is 96, and finally the image is expanded to obtain a two-dimensional vector with a size of W/4*H/4*96. This two-dimensional vector Each row of data in represents the characteristic information of a dimension. It should be noted that the image merging layer is an existing network structure layer, which can be found in the existing literature: Liu Z, Lin Y, Cao Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[J]. arXiv preprint arXiv:2103.14030 , 2021.

解码器的作用是将深层特征图进行上采样，将全局特征恢复至输入分辨率大小，进行像素级的恢复预测。解码器包括N个依次级联执行上采样的图像扩展层，且编码器中的图像合并层、解码器中的图像扩展层数量相同且一一对应。本实施例中，图像扩展层（PatchExpanding）采用了Swin-Unet的Patch Expanding层，可参见现有文献：Cao H, Wang Y,Chen J, et al. Swin-Unet: Unet-like Pure Transformer For Medical ImageSegmentation. arXiv[J]. arXiv preprint arXiv:2105.05537, 2021。The role of the decoder is to upsample the deep feature map, restore the global feature to the input resolution size, and perform pixel-level recovery prediction. The decoder includes N image extension layers cascaded in sequence to perform upsampling, and the number of image combining layers in the encoder and the image extension layers in the decoder are the same and one-to-one correspondence. In this embodiment, the image expansion layer (PatchExpanding) adopts the Patch Expanding layer of Swin-Unet, which can be referred to the existing literature: Cao H, Wang Y, Chen J, et al. Swin-Unet: Unet-like Pure Transformer For Medical ImageSegmentation. arXiv[J]. arXiv preprint arXiv:2105.05537, 2021.

本实施例中任意相邻的图像合并层之间、相邻的图像合并层和图像扩展层之间、以及相邻的图像扩展层之间均串接有用于提取全局特征的旋转变换器（SwinTransformer）块。此外，为了弥补由于空间下采样所产生的空间信息损失，提出网络效仿U-net结构，本实施例中编码器的前N-1个图像合并层与对应的图像扩展层之间设有跳跃连接，以用于将下采样得到的特征图与对应上采样的特征图进行通道方向的拼接后通过全连接层调整拼接特征图的通道维度使通道维度不发生改变。通过跳跃连接，能够实现在多尺度上对网络模块所提取的浅层特征和深层特征进行融合，以弥补由于下采样操作所产生的图像空间信息损失问题。In this embodiment, between any adjacent image merging layers, between adjacent image merging layers and image extension layers, and between adjacent image extension layers, a rotation transformer (SwinTransformer) for extracting global features is connected in series. )piece. In addition, in order to make up for the loss of spatial information due to spatial downsampling, the network is proposed to follow the U-net structure. In this embodiment, there are skip connections between the first N-1 image merging layers of the encoder and the corresponding image expansion layers , which is used to splice the feature map obtained by downsampling and the corresponding upsampled feature map in the channel direction, and then adjust the channel dimension of the spliced feature map through the fully connected layer so that the channel dimension does not change. Through the skip connection, the shallow features and deep features extracted by the network module can be fused on multiple scales to make up for the loss of image space information caused by the downsampling operation.

在基于自注意力机制的融合网络工作时，首先由第一个图像合并层得到大小为W/4*H/4*96的二维向量，随后二维向量通过一些旋转变换器（Swin Transformer）块和图像合并层从而产生不同层次的特征表达：旋转变换器块负责提取图像的全局信息，图像合并层则继续实现对特征图（二维向量）的下采样以及增加特征维度功能，最终送入解码器的第一个图像扩展层（Patch Expanding）。然后前两层图像扩展层依次进行上采样，最后通过一个全连接层将上采样后所得图像恢复至输入图像的空间分辨率大小、特征维度还原为输入图像的原始光谱维度。When working in a fusion network based on the self-attention mechanism, the first image merging layer obtains a two-dimensional vector with a size of W/4*H/4*96, and then the two-dimensional vector passes through some rotation transformers (Swin Transformer) The block and image merging layer produces different levels of feature expression: the rotation transformer block is responsible for extracting the global information of the image, and the image merging layer continues to realize the downsampling of the feature map (two-dimensional vector) and the function of increasing the feature dimension, and finally sends it to The first image expansion layer of the decoder (Patch Expanding). Then the first two layers of image expansion layers are sequentially up-sampled, and finally the up-sampled image is restored to the spatial resolution size of the input image, and the feature dimension is restored to the original spectral dimension of the input image through a fully connected layer.

旋转变换器（Swin Transformer）块用于提取图像的全局信息。旋转变换器块是将变换器（Transformer，一种现有的神经网络模块）块中标准的多头自注意力模块替换为基于移位窗口机制的多头自注意力（SW-MSA）模块，而其他层保持不变，使得旋转变换器块分别包含顺序连接的基于窗口的多头自注意力（W-MSA）模块和基于移位窗口的多头注意力（SW-MSA）模块，可参见现有文献：Liu Z, Lin Y, Cao Y, et al. Swin Transformer:Hierarchical Vision Transformer using Shifted Windows[J]. arXiv preprintarXiv:2103.14030, 2021。旋转变换器块由归一化层（LayerNorm，简称LN）层、多头自注意力模块（简称MSA）、残差连接结构以及激活函数为GELU的双层全连接网络（简称MLP）构成，旋转变换器块在每一个MSA模块和MLP网络之前都会使用一个LayerNorm层对输入数据进行通道方向上的归一化，每一个模块的输出都会应用一个残差连接以增加该网络结构的灵活性。基于旋转变换器块中的移位窗口机制，两个连续的旋转变换器块可被表达为：The Swin Transformer block is used to extract the global information of the image. The rotation transformer block replaces the standard multi-head self-attention module in the transformer (Transformer, an existing neural network module) block with a multi-head self-attention (SW-MSA) module based on the shift window mechanism, while other The layers are kept unchanged so that the rotation transformer block contains sequentially connected window-based multi-head self-attention (W-MSA) module and shifted window-based multi-head attention (SW-MSA) module respectively, which can be found in the existing literature: Liu Z, Lin Y, Cao Y, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[J]. arXiv preprint arXiv:2103.14030, 2021. The rotation transformer block is composed of a normalization layer (LayerNorm, referred to as LN) layer, a multi-head self-attention module (abbreviated as MSA), a residual connection structure, and a double-layer fully connected network (abbreviated as MLP) whose activation function is GELU. Before each MSA module and MLP network, the block uses a LayerNorm layer to normalize the input data in the channel direction, and the output of each module applies a residual connection to increase the flexibility of the network structure. Based on the shift window mechanism in the rotation transformer block, two consecutive rotation transformer blocks can be expressed as:

，

,

，

,

，

,

，

,

上式中，

和

分别代表第l个旋转变换器块的基于窗口的多头自注意力（W-MSA）模块或基于移位窗口的多头注意力（SW-MSA）模块的输出特征和双层全连接网络的输出特征。SW指基于移位窗口的注意力计算，W就是传统的窗口注意力计算机制，MSA表示多头自注意力模块，LN为归一化层，MLP表示双层全连接网络。对自注意力进行计算时，将延续过去相关工作所使用的计算方法，即计算相似度时每一个头部都会包含相对位置偏置：In the above formula,

and

represent the output features of the window-based multi-head self-attention (W-MSA) module or the shifted window-based multi-head attention (SW-MSA) module and the output features of the two-layer fully connected network, respectively, of the l -th rotation transformer block . SW refers to the attention calculation based on the shift window, W is the traditional window attention calculation mechanism, MSA refers to the multi-head self-attention module, LN refers to the normalization layer, and MLP refers to the double-layer fully connected network. When calculating self-attention, the calculation method used in past related work will be continued, that is, each head will contain a relative position offset when calculating similarity:

，

,

，

,

上式中，R表示实数，

分别代表查询矩阵、键矩阵和值矩阵，d表示查询矩阵与键矩阵的维度，M ²代表一个窗口中的图像块数目，M为窗口大小，B包含的值取自一个较小的相对位置偏置矩阵：In the above formula, R represents a real number,

represent the query matrix, key matrix and value matrix respectively, d represents the dimensions of the query matrix and key matrix, M ² represents the number of image blocks in a window, M is the window size, and the values contained in B are taken from a small relative position offset set matrix:

，

,

该相对位置偏置矩阵

由基于自注意力机制的融合网络对中波红外高光谱数据集进行学习得到。The relative position offset matrix

It is obtained by learning a fusion network based on a self-attention mechanism on a mid-wave infrared hyperspectral dataset.

本实施例所提出网络的编码器部分以旋转变换器（Swin Transformer）所提出的图像编码结构为基础，图像合并层（Patch Merging层）负责对特征图进行下采样以及增加特征维度，旋转变换器块通过自注意力机制提取图像的全局特征。图像合并层首先对输入图像进行PixelUnshuffle操作，实现输入图像的两倍空间下采样以及通道数变为原来的4倍，紧接着通过LayerNorm层（归一化层）进行特征图通道方向的归一化，最后通过全连接层将特征图的通道数减半。编码器共包含三层图像合并层和对应的旋转变换器块，其中每一层图像合并层的下采样倍数分别被设置为{4，2，2}，每一层旋转变换器块的数目分别被设置为{2，2，1}。The encoder part of the network proposed in this embodiment is based on the image coding structure proposed by the rotation transformer (Swin Transformer). The image merging layer (Patch Merging layer) is responsible for downsampling the feature map and increasing the feature dimension. The rotation transformer The block extracts the global features of the image through a self-attention mechanism. The image merging layer first performs the PixelUnshuffle operation on the input image to achieve twice the spatial downsampling of the input image and the number of channels becomes 4 times the original, and then normalizes the channel direction of the feature map through the LayerNorm layer (normalization layer) , and finally halve the number of channels of the feature map through a fully connected layer. The encoder consists of three layers of image combining layers and corresponding rotation transformer blocks, where the downsampling multiples of each layer of image combining layers are set to {4, 2, 2}, and the number of rotation transformer blocks in each layer is respectively is set to {2, 2, 1}.

本实施例所提出网络的解码器部分主要实现特征图的上采样功能，将全局特征恢复至输入分辨率大小，进行像素级的恢复预测。解码器主要由图像扩展层和旋转变换器块构成，图像合并层的设计参考了Swin-Unet网络中上采样层的结构，图像扩展层和图像合并层具有对称的结构，故针对输入特征图图像扩展层实现与图像合并层相反的PixelShuffle操作，即空间上采样功能，而解码器部分的旋转变换器块仍然负责学习特征图的全局信息。为了设计对称的编解码结构，解码器同样包含三层图像扩展层和旋转变换器块，其中每一层图像扩展层的上采样倍数分别被设置为{2，2，4}，每一层旋转变换器块的数目分别被设置为{1，2，2}。The decoder part of the network proposed in this embodiment mainly implements the upsampling function of the feature map, restores the global feature to the input resolution size, and performs pixel-level restoration prediction. The decoder is mainly composed of an image expansion layer and a rotation transformer block. The design of the image merging layer refers to the structure of the upsampling layer in the Swin-Unet network. The image expansion layer and the image merging layer have a symmetrical structure, so for the input feature map image The extension layer implements the PixelShuffle operation opposite to the image pooling layer, that is, the spatial upsampling function, while the rotation transformer block in the decoder part is still responsible for learning the global information of the feature map. In order to design a symmetrical codec structure, the decoder also contains three layers of image extension layers and rotation transformer blocks, where the upsampling multiples of each layer of image extension layers are set to {2, 2, 4}, and each layer of rotation The numbers of transformer blocks are set to {1, 2, 2}, respectively.

作为一种可选的实施方式，为了增强旋转变换器块所提取的特征，本实施例中在每一个所述旋转变换器块之后均对应连接有一个卷积层，所述卷积层用于将卷积结构的归纳偏置性引入旋转变换器块，从而得到增强旋转变换器块所提取的全局特征。As an optional implementation manner, in order to enhance the features extracted by the rotation transformer block, in this embodiment, a convolutional layer is correspondingly connected after each rotation transformer block, and the convolutional layer is used for The inductive bias of the convolutional structure is introduced into the rotation transformer block, resulting in an enhanced global feature extracted by the rotation transformer block.

参见图2，本实施例中卷积层的卷积核大小为3×3。Referring to FIG. 2 , the size of the convolution kernel of the convolution layer in this embodiment is 3×3.

此外作为一种可选的实施方式，为了加快网络的训练，提升融合效果，本实施例中还包括在每一个所述旋转变换器块均在卷积层之后还对应连接有一个残差模块，所述残差模块用于将该旋转变换器块的输入、该旋转变换器块对应的卷积层的输出做差后输出至下一个图像合并层或者图像扩展层。In addition, as an optional implementation, in order to speed up the training of the network and improve the fusion effect, this embodiment also includes that each of the rotation transformer blocks is connected to a residual module correspondingly after the convolution layer, The residual module is used to make a difference between the input of the rotation transformer block and the output of the convolutional layer corresponding to the rotation transformer block, and then output to the next image combining layer or image expansion layer.

如图2和图3所示，本实施例中编码器包括3个依次级联执行下采样的图像合并层，所述解码器包括3个依次级联执行上采样的图像扩展层。中波红外高光谱图像Y的大小为W/16*H/16*31，上采样中波红外高光谱图像Y^U的大小为W*H*31，中波红外多光谱图像Z的大小为W*H*3，图像块C的大小为W*H*34，第一个图像合并层输出的特征图大小为W/4*H/4*96，第二个图像合并层输出的特征图大小为W/8*H/8*192，第三个图像合并层输出的特征图大小为W/16*H/16*384，第一个图像扩展层2倍上采样后输出的特征图大小为W/8*H/8*192，第二个图像扩展层2倍上采样后输出的特征图大小为W/4*H/4*96，第三个图像扩展层4倍上采样后通过一个全连接层将特征维度还原为31个光谱维度，输出的特征图大小为W*H*31的残差图像X_res，其中W为残差图像X_res的宽度，H为残差图像X_res的高度。As shown in FIG. 2 and FIG. 3 , in this embodiment, the encoder includes three image combining layers sequentially cascaded to perform downsampling, and the decoder includes three image expansion layers sequentially cascaded to perform upsampling. The size of the mid-wave infrared hyperspectral image Y is W/16*H/16*31, the size of the upsampled mid-wave infrared hyperspectral image Y ^U is W*H*31, and the size of the mid-wave infrared multispectral image Z is W *H*3, the size of the image block C is W*H*34, the size of the feature map output by the first image merging layer is W/4*H/4*96, the size of the feature map output by the second image merging layer is W/8*H/8*192, the size of the feature map output by the third image merging layer is W/16*H/16*384, and the size of the feature map output by the first image expansion layer after 2 times upsampling is W/8*H/8*192, the size of the feature map output after the second image expansion layer is 2 times upsampled is W/4*H/4*96, the third image expansion layer is 4 times upsampled and passed a The fully connected layer restores the feature dimension to 31 spectral dimensions, and the output feature map size is a residual image X _res of W*H*31, where W is the width of the residual image X _res , and H is the size of the residual image X _res high.

本实施例基于自注意力机制的融合网络具有下述优点：（1）利用了空间自注意力机制中提取特征图像远距离依赖信息和全局信息的优异能力，缓解了由于卷积核感受野有限，导致卷积神经网络在提取中波红外高光谱图像特征时产生的空间信息损失问题，这样能够提升中波红外高光谱图像的重构精度和计算效率，进而有效实现低分辨率的中波红外高光谱图像、高分辨率的中波红外多光谱图像融合得到高分辨率的中波红外高光谱图像。（2）融合网络专注于学习中波红外高光谱图像的残差域，而不是直接学习中波红外高光谱图像所在的图像域，这使得该网络需要学习的映射空间更小，从而提升计算效率，使网络更容易进行训练。（3）将卷积结构和自注意力机制结合，提升了自注意力层在提取图像特征时的归纳偏置能力，缓解了基于自注意力机制的学习网络对大量数据的训练要求，使得提出网络在较小的中波红外高光谱数据集上能够更高效地利用数据，从而实现更加优异的融合性能。（4）在对不同类型的中波红外高光谱和中波红外多光谱图像融合时，不需要改变网络的结构，仅需要提前准备好相应类型的中波红外高光谱和中波红外多光谱图像训练融合网络，网络模型训练完成后便可以投入使用，具有很强的普适性和鲁棒性。（5）适用于各种维度不同的中波红外高光谱和中波红外多光谱数据融合，可以获得高质量的中波红外高分辨率高光谱图像，并且拥有抗噪声干扰的能力。The fusion network based on the self-attention mechanism of this embodiment has the following advantages: (1) It utilizes the excellent ability of extracting feature image long-distance dependent information and global information in the spatial self-attention mechanism, and alleviates the problem caused by the limited receptive field of the convolution kernel. , resulting in the loss of spatial information when the convolutional neural network extracts the features of the mid-wave infrared hyperspectral image, which can improve the reconstruction accuracy and computational efficiency of the mid-wave infrared hyperspectral image, and then effectively realize the low-resolution mid-wave infrared Hyperspectral images, high-resolution mid-wave infrared multi-spectral images are fused to obtain high-resolution mid-wave infrared hyperspectral images. (2) The fusion network focuses on learning the residual domain of the mid-wave infrared hyperspectral image, rather than directly learning the image domain where the mid-wave infrared hyperspectral image is located, which makes the mapping space that the network needs to learn smaller, thereby improving computational efficiency , making the network easier to train. (3) Combining the convolutional structure and the self-attention mechanism improves the inductive bias ability of the self-attention layer when extracting image features, and eases the training requirements of the learning network based on the self-attention mechanism for a large amount of data, making the proposed The network can use data more efficiently on smaller mid-wave infrared hyperspectral datasets, resulting in better fusion performance. (4) When merging different types of mid-wave infrared hyperspectral and mid-wave infrared multispectral images, there is no need to change the structure of the network, only the corresponding types of mid-wave infrared hyperspectral and mid-wave infrared multispectral images need to be prepared in advance After training the fusion network, the network model can be put into use after training, which has strong universality and robustness. (5) It is suitable for the fusion of mid-wave infrared hyperspectral and mid-wave infrared multispectral data of different dimensions, and can obtain high-quality mid-wave infrared high-resolution hyperspectral images, and has the ability to resist noise interference.

为了对本实施例中波红外高光谱和中波红外多光谱图像融合方法进行验证，本实施例中利用CAVE数据集和Harvard数据集进行模拟实验。CAVE数据集包括32张波长数目为31、空间分辨率为512*512的高光谱图像， Harvard数据集包含50张波长数目为31、空间分辨率为1392*1040的高光谱图像。在模拟实验中，将CAVE数据集或Harvard数据集中的参考图像作为中波红外高分辨率高光谱图像真值，分别进行高斯模糊、空间下采样和光谱下采样，从而获得训练网络需要的低空间分辨率中波红外高光谱图像数据集和高空间分辨率中波红外多光谱图像数据集。首先使用大小为7*7，均值为0，标准差为3的高斯模糊核对参考图像去噪，然后进行16倍的空间下采样得到低分辨率中波红外高光谱图像。对于CAVE数据集，经过上述操作得到的低分辨率中波红外高光谱图像的大小为32*32*31；对于Harvard数据集，经过上述操作得到的低分辨率中波红外高光谱图像的大小为87*65*31。为了创建一个波段数为3的高分辨率中波红外多光谱图像，使用一个已知的光谱下采样矩阵对高光谱数据集中的参考图像进行光谱下采样。并对比了4种典型的高光谱和多光谱图像融合方法。其中融合图像的评价指标有4种，分别是峰值信噪比（PSNR）、光谱角（SAM）、统一图像质量指标（UIQI）、相对无量纲全局误差（ERGAS）和均方根误差（RMSE）。其中PSNR、UIQI的值越大，高分辨率图像质量越好，SAM、ERGAS和RMSE的值越大代表高分辨率图像的质量越差。表1展示了4种典型的融合方法（CSU, Hysure, CSTF, CNN_Fus）和本实施例提出的方法（mine）在CAVE数据集上融合实验的客观评价指标，最好的数值结果被标黑。表2展示了4种典型的融合方法（CSU, Hysure, CSTF, CNN_Fus）和本实施例提出的方法（mine）在Harvard数据集上融合实验的客观评价指标，最好的数值结果被标黑。In order to verify the mid-wave infrared hyperspectral and mid-wave infrared multi-spectral image fusion method of this embodiment, the CAVE data set and the Harvard data set are used to conduct simulation experiments in this embodiment. The CAVE dataset includes 32 hyperspectral images with 31 wavelengths and a spatial resolution of 512*512, and the Harvard dataset contains 50 hyperspectral images with 31 wavelengths and a spatial resolution of 1392*1040. In the simulation experiment, the reference image in the CAVE dataset or the Harvard dataset is used as the true value of the mid-wave infrared high-resolution hyperspectral image, and Gaussian blurring, spatial downsampling and spectral downsampling are performed respectively, so as to obtain the low space required for training the network. A high-resolution mid-wave infrared hyperspectral image dataset and a high-spatial resolution mid-wave infrared multispectral image dataset. First, a Gaussian blur kernel with a size of 7*7, a mean value of 0, and a standard deviation of 3 was used to denoise the reference image, and then a 16-fold spatial downsampling was performed to obtain a low-resolution mid-wave infrared hyperspectral image. For the CAVE dataset, the size of the low-resolution mid-wave infrared hyperspectral image obtained through the above operations is 32*32*31; for the Harvard dataset, the size of the low-resolution mid-wave infrared hyperspectral image obtained through the above operations is 87*65*31. To create a high-resolution mid-wave infrared multispectral image with a band number of 3, the reference images in the hyperspectral dataset were spectrally downsampled using a known spectral downsampling matrix. Four typical hyperspectral and multispectral image fusion methods are compared. Among them, there are four kinds of evaluation indicators for fusion images, namely peak signal-to-noise ratio (PSNR), spectral angle (SAM), unified image quality index (UIQI), relative dimensionless global error (ERGAS) and root mean square error (RMSE). . Among them, the larger the value of PSNR and UIQI, the better the quality of the high-resolution image, and the larger the value of SAM, ERGAS and RMSE, the worse the quality of the high-resolution image. Table 1 shows the objective evaluation indicators of the fusion experiments of four typical fusion methods (CSU, Hysure, CSTF, CNN_Fus) and the method proposed in this example (mine) on the CAVE dataset. The best numerical results are marked in black. Table 2 shows the objective evaluation indicators of four typical fusion methods (CSU, Hysure, CSTF, CNN_Fus) and the method proposed in this example (mine) fusion experiments on the Harvard dataset. The best numerical results are marked in black.

表1：CAVE数据上本实施例方法与四种典型融合方法的客观性能指标。Table 1: Objective performance indicators of the method of this embodiment and four typical fusion methods on CAVE data.

表2：Harvard数据上本实施例方法与四种典型融合方法的客观性能指标。Table 2: Objective performance indicators of the method of this embodiment and four typical fusion methods on Harvard data.

从表1和表2可以看出，本实施例提出的方法（mine）的所有客观评价指标都优于其它方法，这是因为本实施例所提出的深度融合网络是基于自注意力机制提取特征的，与只专注于提取图像局部特征的传统卷积神经网络不同，自注意力层能够提取图像的全局特征和远程依赖信息，使得网络能够更充分学习图像的空间细节信息，实现原始中波红外高光谱图像分辨率的进一步提高。It can be seen from Table 1 and Table 2 that all the objective evaluation indicators of the method (mine) proposed in this example are superior to other methods, because the deep fusion network proposed in this example is based on the self-attention mechanism to extract features Yes, unlike the traditional convolutional neural network that only focuses on extracting local features of the image, the self-attention layer can extract the global features and long-range dependency information of the image, enabling the network to more fully learn the spatial details of the image and realize the original mid-wave infrared Further improvements in hyperspectral image resolution.

图4为5种融合方法在CAVE测试数据集上的融合结果及其误差图像对比。其中：（a-1）为中波红外高光谱图像的第19个波段的原图，（a-2）为理想的误差图像；（b-1）为CSU方法融合得到的中波红外高光谱图像的第19个波段的融合结果，（b-2）为CSU方法融合得到的中波红外高光谱图像的第19个波段的误差图像，（c-1）为Hysure方法融合得到的高分辨率中波红外高光谱图像的第19个波段的融合结果，（c-2）为Hysure方法融合得到的高分辨率中波红外高光谱图像的第19个波段的误差图像，（d-1）为CSTF方法融合得到的高分辨率中波红外高光谱图像的第19个波段的融合结果，（d-2）为CSTF方法融合得到的高分辨率中波红外高光谱图像的第19个波段的误差图像，（e-1）为 CNN_Fus方法融合得到的高分辨率中波红外高光谱图像的第19个波段的融合结果，（e-2）为 CNN_Fus方法融合得到的高分辨率中波红外高光谱图像的第19个波段的误差图像，（f-1）为本实施例提出方法融合得到的高分辨率中波红外高光谱图像的第19个波段的融合结果，（f-2）为本实施例提出方法融合得到的高分辨率中波红外高光谱图像的第19个波段的误差图像。图5为5种融合方法在Harvard测试数据集上的融合结果及其误差图像对比。其中：（a-1）为中波红外高光谱图像的第28个波段的原图，（a-2）为理想的误差图像；（b-1）为CSU方法融合得到的高分辨率中波红外高光谱图像的第28个波段的融合结果，（b-2）为CSU方法融合得到的高分辨率中波红外高光谱图像的第28个波段的误差图像，（c-1）为Hysure方法融合得到的高分辨率中波红外高光谱图像的第28个波段的融合结果，（c-2）为Hysure方法融合得到的高分辨率中波红外高光谱图像的第28个波段的误差图像，（d-1）为CSTF方法融合得到的高分辨率中波红外高光谱图像的第28个波段的融合结果，（d-2）为CSTF方法融合得到的高分辨率中波红外高光谱图像的第28个波段的误差图像，（e-1）为 CNN_Fus方法融合得到的高分辨率中波红外高光谱图像的第28个波段的融合结果，（e-2）为 CNN_Fus方法融合得到的高分辨率中波红外高光谱图像的第28个波段的误差图像，（f-1）为本实施例提出方法融合得到的高分辨率中波红外高光谱图像的第28个波段的融合结果，（f-2）为本实施例提出方法融合得到的高分辨率中波红外高光谱图像的第28个波段的误差图像。图4和图5中，误差图像反映了融合结果和真值高光谱图像间的差异，从各种方法融合结果的误差图像可以看出，其它方法融合得到的中波红外高分辨率高光谱图像具有明显的瑕疵，本实施例提出方法在保证高光谱空间分辨率提升的同时，能够更好地恢复图像的空间细节和结构信息，其融合得到的中波红外高分辨率高光谱图像的空间质量最好。Figure 4 shows the comparison of the fusion results and error images of the five fusion methods on the CAVE test data set. Among them: (a-1) is the original image of the 19th band of the mid-wave infrared hyperspectral image, (a-2) is the ideal error image; (b-1) is the mid-wave infrared hyperspectral image fused by the CSU method The fusion result of the 19th band of the image, (b-2) is the error image of the 19th band of the mid-wave infrared hyperspectral image fused by the CSU method, (c-1) is the high-resolution image fused by the Hysure method The fusion result of the 19th band of the mid-wave infrared hyperspectral image, (c-2) is the error image of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the Hysure method, (d-1) is The fusion result of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the CSTF method, (d-2) is the error of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the CSTF method Image, (e-1) is the fusion result of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the CNN_Fus method, (e-2) is the high-resolution mid-wave infrared hyperspectral image fused by the CNN_Fus method The error image of the 19th band of the image, (f-1) is the fusion result of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the method proposed in this example, (f-2) is the result of this implementation The example proposes the error image of the 19th band of the high-resolution mid-wave infrared hyperspectral image fused by the proposed method. Figure 5 shows the comparison of the fusion results and error images of the five fusion methods on the Harvard test data set. Among them: (a-1) is the original image of the 28th band of the mid-wave infrared hyperspectral image, (a-2) is the ideal error image; (b-1) is the high-resolution mid-wave fusion obtained by the CSU method The fusion result of the 28th band of the infrared hyperspectral image, (b-2) is the error image of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the CSU method, (c-1) is the Hysure method The fusion result of the 28th band of the high-resolution mid-wave infrared hyperspectral image obtained by fusion, (c-2) is the error image of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the Hysure method, (d-1) is the fusion result of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the CSTF method, (d-2) is the fusion result of the high-resolution mid-wave infrared hyperspectral image obtained by the CSTF method The error image of the 28th band, (e-1) is the fusion result of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the CNN_Fus method, (e-2) is the high-resolution fusion obtained by the CNN_Fus method The error image of the 28th band of the mid-wave infrared hyperspectral image, (f-1) is the fusion result of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the method proposed in this embodiment, (f -2) The error image of the 28th band of the high-resolution mid-wave infrared hyperspectral image fused by the method proposed in this embodiment. In Figure 4 and Figure 5, the error image reflects the difference between the fusion result and the true hyperspectral image. From the error images of the fusion results of various methods, it can be seen that the mid-wave infrared high-resolution hyperspectral image obtained by other methods With obvious flaws, the method proposed in this embodiment can better restore the spatial details and structural information of the image while ensuring the improvement of the hyperspectral spatial resolution, and the spatial quality of the mid-wave infrared high-resolution hyperspectral image obtained by its fusion most.

综上所述，本实施例中波红外高光谱和中波红外多光谱图像融合方法借鉴U-net网络结构，建立了一个中波红外高光谱和中波红外多光谱图像融合的编解码器网络模型，通过自注意力层在多尺度上提取并融合特征图像的全局信息和远程依赖信息，缓解了传统卷积神经网络中随着网络深度的增加而产生的空间信息损失问题。此外，为了提高融合网络对于高光谱训练数据的利用效率和融合效果，本实施例在融合网络的自注意力层后均添加了卷积层，以将卷积结构的归纳偏置性引入提出网络。本实施例所提出的网络并没有直接对中波红外高光谱图像和中波红外多光谱图像到融合中波红外高光谱图像的映射进行建模，而是选择学习融合中波红外高光谱的残差图像，这样做既可以加快网络的训练速度，又可以提高融合精度和质量。通过在高光谱测试数据集上与其它高性能的高光谱和多光谱图像融合方法进行对比实验，发现本实施例中波红外高光谱和中波红外多光谱图像融合方法融合出来的中波红外高光谱图像具有更好的质量，具有很强的抗噪声干扰能力，且在针对不同类型的中波红外高光谱和中波红外多光谱图像融合时，不需要改变网络的结构，只需要提前准备好相应类型的中波红外高光谱和中波红外多光谱图像进行训练，网络模型训练完成后便可以进行使用，具有很强的普适性和鲁棒性。In summary, the fusion method of mid-wave infrared hyperspectral and mid-wave infrared multispectral images in this embodiment draws on the U-net network structure to establish a codec network for fusion of mid-wave infrared hyperspectral and mid-wave infrared multispectral images The model extracts and fuses the global information and long-range dependency information of feature images on multiple scales through the self-attention layer, which alleviates the problem of spatial information loss that occurs with the increase of network depth in traditional convolutional neural networks. In addition, in order to improve the utilization efficiency and fusion effect of the fusion network for hyperspectral training data, this embodiment adds a convolution layer after the self-attention layer of the fusion network to introduce the inductive bias of the convolution structure into the proposed network . The network proposed in this example does not directly model the mapping of mid-wave infrared hyperspectral images and mid-wave infrared multispectral images to fused mid-wave infrared hyperspectral images, but chooses to learn the residual This can not only speed up the training speed of the network, but also improve the fusion accuracy and quality. Through comparative experiments with other high-performance hyperspectral and multispectral image fusion methods on the hyperspectral test data set, it is found that the midwave infrared hyperspectral and midwave infrared multispectral image fusion methods in this embodiment are fused. Spectral images have better quality and strong anti-noise ability, and when merging different types of mid-wave infrared hyperspectral and mid-wave infrared multispectral images, there is no need to change the structure of the network, only need to prepare in advance Corresponding types of mid-wave infrared hyperspectral and mid-wave infrared multispectral images are used for training, and the network model can be used after training, which has strong universality and robustness.

此外，本实施例还提供一种中波红外高光谱及多光谱图像融合系统，包括相互连接的微处理器和存储器，所述微处理器被编程或配置以执行前述中波红外高光谱及多光谱图像融合方法的步骤。In addition, this embodiment also provides a mid-wave infrared hyperspectral and multi-spectral image fusion system, including interconnected microprocessors and memory, the microprocessor is programmed or configured to perform the aforementioned mid-wave infrared hyperspectral and multi-spectral image fusion system. Steps of the spectral image fusion method.

此外，本实施例还提供一种计算机可读存储介质，其中存储有计算机程序，所述计算机程序用于被微处理器编程或配置以执行前述中波红外高光谱及多光谱图像融合方法的步骤。In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, and the computer program is used to be programmed or configured by a microprocessor to perform the steps of the aforementioned mid-wave infrared hyperspectral and multispectral image fusion method .

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可读存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。本申请是参照根据本申请实施例的方法、设备（系统）、和计算机程序产品的流程图和／或方框图来描述的。应理解可由计算机程序指令实现流程图和／或方框图中的每一流程和／或方框、以及流程图和／或方框图中的流程和／或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的步骤。Those skilled in the art should understand that the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The present application is described with reference to flowcharts and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and combinations of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a Means for realizing the functions specified in one or more steps of the flowchart and/or one or more blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram. These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart flow or flows and/or block diagram block or blocks.

以上所述仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above descriptions are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention should also be regarded as the protection scope of the present invention.

Claims

1. A method for fusing medium-wave infrared hyperspectral and multispectral images is characterized by comprising the following steps:

s1, performing spatial up-sampling on an input medium wave infrared hyperspectral image Y to obtain an up-sampled medium wave infrared hyperspectral image Y ^U ；

S2, the input medium wave infrared multispectral image Z and the up-sampling medium wave infrared hyperspectral image Y are processed ^U Splicing according to the spectral dimensions to obtain an image block C;

s3, extracting residual error image X of image block C _res (ii) a Said extracting corresponding residual image X _res The system is realized by a self-attention-machine-based fusion network which is trained in advance and consists of an encoder and a decoder which are connected with each other, wherein the encoder comprises N image merging layers which are sequentially cascaded to execute down-sampling, the decoder comprises N image expanding layers which are sequentially cascaded to execute up-sampling, the number of the image merging layers in the encoder and the number of the image expanding layers in the decoder are the same and are in one-to-one correspondence, rotary converter blocks used for extracting global features are connected in series between any adjacent image merging layers, between adjacent image merging layers and image expanding layers and between adjacent image expanding layers, and jump connections are arranged between the first N-1 image merging layers and the corresponding image expanding layers of the encoder and used for adjusting the channel dimension of a spliced feature map through a full-connection layer after the feature map obtained by down-sampling and the feature map corresponding to up-sampling are subjected to channel direction, so that the channel dimension is not changed; the image merging layer is used for carrying out pixel recombination PixelUnshuffle operation on an input image, realizing double spatial down-sampling of the input image and conversion of the number of channels into 4 times of the original number, then carrying out normalization of the channel direction of the feature map through the normalization layer, and finally reducing the number of channels of the feature map by half through the full connection layer; the image expansion layer andthe image merging layer has a symmetrical structure, and pixel recombination PixelShuffle operation opposite to that of the image merging layer is realized aiming at the input feature map image expansion layer to complete a spatial up-sampling function; the rotary transformer block is composed of a normalization layer, a multi-head self-attention module, a residual connection structure and a double-layer fully-connected network with an activation function of GELU, and the rotary transformer block uses one normalization layer to normalize the input data in the channel direction before each multi-head self-attention module and the double-layer fully-connected network; a convolution layer is correspondingly connected behind each rotary converter block, and the convolution layers are used for introducing the inductive bias of the convolution structure into the rotary converter blocks; each rotary converter block is correspondingly connected with a residual module behind the convolution layer, and the residual module is used for subtracting the input of the rotary converter block and the output of the convolution layer corresponding to the rotary converter block and outputting the difference to the next image merging layer or the next image extension layer;

s4, residual image X _res Up-sampling medium wave infrared high spectrum image Y ^U And adding the pixel values based on the positions to obtain a fused medium-wave infrared hyperspectral image X.

2. The medium-wave infrared hyperspectral and multispectral image fusion method according to claim 1, wherein the step S1 of spatially upsampling the input medium-wave infrared hyperspectral image Y is to spatially upsample the input medium-wave infrared hyperspectral image Y by a bicubic interpolation method to obtain an upsampled medium-wave infrared hyperspectral image Y ^U 。

3. The method according to claim 1, wherein the convolution kernel size of the convolution layer is 3 x 3.

4. The method according to claim 1, wherein the encoder comprises 3 image merging layers sequentially cascaded to perform downsampling, and the decoder comprises 3 image extension layers sequentially cascaded to perform upsampling.

5. The method according to claim 4, wherein the size of the mid-wave IR hyperspectral image Y is W/16H/16 31, and the up-sampled mid-wave IR hyperspectral image Y ^U The size of the medium wave infrared multispectral image Z is W X H X31, the size of the image block C is W X H34, the size of the feature graph output by the first image merging layer is W/4X H/4X 96, the size of the feature graph output by the second image merging layer is W/8X H/8X 192, the size of the feature graph output by the third image merging layer is W/16X H/16X 384, the size of the feature graph output by the first image expanding layer after being sampled at a rate of 2 times is W/8X H/8X 192, the size of the feature graph output by the second image expanding layer after being sampled at a rate of 2 times is W/4X H/4X 96, the size of the feature graph output by the third image expanding layer after being sampled at a rate of 4 times is reduced into 31 spectral dimensions through a full connecting layer, and the size of the residual image output by the feature graph is W X H31 _res Where W is the residual image X _res H is a residual image X _res The height of (c).

6. A mid-wave infrared hyperspectral and multispectral image fusion system comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor is programmed or configured to perform the steps of the mid-wave infrared hyperspectral and multispectral image fusion method according to any one of claims 1 to 5.

7. A computer readable storage medium having stored thereon a computer program, wherein the computer program is adapted to be programmed or configured by a microprocessor to perform the steps of the method for mid-wave infrared hyperspectral and multispectral image fusion according to any of claims 1 to 5.