CN116543227A

CN116543227A - Remote sensing image scene classification method based on graph convolution network

Info

Publication number: CN116543227A
Application number: CN202310577746.8A
Authority: CN
Inventors: 李群; 鲁锦涛; 陈宇; 李洁; 邹圣兵
Original assignee: Beijing Shuhui Spatiotemporal Information Technology Co ltd
Current assignee: Beijing Shuhui Spatiotemporal Information Technology Co ltd
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-04

Abstract

The invention provides a remote sensing image scene classification method based on a graph rolling network, which relates to the technical field of remote sensing, and comprises the following steps: constructing a scene classification model, and training the model: performing super-pixel segmentation on the training sample to obtain an image block; extracting an image block by using a depth convolution network to obtain a first feature map, constructing a neighborhood map based on the image block and the first feature map, and obtaining a first feature matrix and a neighborhood matrix; establishing a graph rolling network to obtain a second characteristic; performing position coding and attention feature extraction on the image block according to the transducer module to obtain a third feature; fusing the second features and the third features, and inputting the fused features into a classification layer to obtain classification results; and acquiring a remote sensing image to be classified, and inputting the remote sensing image to be classified into a scene classification model to obtain a scene classification result. The scene classification model has high classification precision and can describe semantic information of the remote sensing image.

Description

Remote sensing image scene classification method based on graph convolution network

Technical Field

The invention relates to the technical field of remote sensing, in particular to a remote sensing image scene classification method based on a graph rolling network.

Background

The earth is a common home for human living, and along with the continuous progress of human civilization, the unknown world is continuously discovered and known by utilizing technical means, so that the world becomes a powerful motive force for the progress of the human civilization. Because of the vast and vast area of the earth's surface, humans have evolved over the earth for tens of millions of years, but have limited awareness of their living environment from local to global. Until the occurrence of satellite remote sensing technology in the middle of the 20 th century, human beings acquire image data of the earth surface through the 'sky eye', and a large screen of the whole earth for relatively continuous cognition is not really pulled. Especially, in the early twenty-first century, along with the rapid progress of technology and the vigorous development of remote sensing technology, the number of remote sensing satellites is continuously increased, the observed data of the earth surface obtained by human beings is continuously and rapidly accumulated, the data scale reaches the EB level at present, and then the subsequent big data technology provides technical support for the processing and information mining of massive data.

As one of research hotspots in the current remote sensing earth observation technical field, the remote sensing image scene classification aims at automatically classifying images into a specific semantic tag according to the remote sensing scene image content, and provides auxiliary reference for image understanding. The classification of remote sensing image scenes needs to distinguish semantic categories by means of visual features and spatial context information of the image scenes, and the semantic categories assume that the scenes of the same type have more similar features, so that the key of the scene classification is extraction of image features. The traditional artificial feature extraction method can only acquire the middle and low-layer feature representations of the image, and the method lacks generalization capability and is difficult to accurately describe the image semantics. The hierarchical abstract expression capability of the deep learning technology can automatically learn high-level visual features about the image scene, and effectively improve the scene classification performance. In related researches, a deep convolutional neural network is mostly used for feature learning and classification tasks are completed by combining a classifier. However, most of the existing deep learning methods are based on convolutional neural networks, and the classification accuracy of the classification model used by these methods needs to be further improved.

Disclosure of Invention

Based on the technical problems, the invention provides a remote sensing image scene classification method based on a graph rolling network, which is characterized in that features embedded with space topology information are obtained by establishing the graph rolling network in a training process, and the two features are fused and classified through features of embedded position information extracted by a transducer.

The invention provides a remote sensing image scene classification method based on a graph rolling network, which comprises the following steps:

s1, constructing a scene classification model, and training the scene classification model, wherein the scene classification model comprises a deep convolution network, a graph convolution network, a transducer module and a classification layer, the graph convolution network is built when the scene classification model is trained, and the training steps comprise:

s11, performing super-pixel segmentation on a training sample to obtain a plurality of image blocks, wherein the training sample is expressed as an image block sequence;

s12, inputting training samples into a deep convolution network to obtain a first feature map corresponding to an image block;

s13, constructing a region adjacency graph based on the image block and the first feature graph, and acquiring a first feature matrix and an adjacency matrix;

s14, using the first feature matrix and the adjacent matrix as graph data to establish a graph convolution network, and learning to obtain a second feature according to a message transmission mechanism of the graph convolution network;

s15, inputting the training sample into a transducer module, wherein the transducer module comprises an input layer and a transducer layer, performing position coding on the image blocks by using the input layer to obtain a position vector sequence, embedding the position vector sequence into a representation of a corresponding image block sequence, and then inputting the position vector sequence into the transducer layer to obtain a third characteristic;

s16, fusing the second features and the third features, and inputting the fused features into a classification layer to obtain classification results;

s2, acquiring a remote sensing image to be classified, and inputting the remote sensing image to a scene classification model to obtain a scene classification result.

In an embodiment of the present invention, step S13 includes:

up-sampling the first feature images, and processing the up-sampled first feature images according to a maximum pooling method to obtain a first feature matrix corresponding to each first feature image;

and establishing a space 4 neighborhood relation for the image blocks, and constructing a neighborhood matrix, wherein the neighborhood matrix describes a space topological structure between each image block.

In an embodiment of the present invention, the method for upsampling the first feature map includes:

and interpolating each pixel of the first feature map by adopting a nearest neighbor interpolation method, and amplifying the first feature map to be the same as the corresponding image block in size.

In a specific embodiment of the invention, the second feature comprises a representation of the spatial topology.

In one embodiment of the present invention, the image blocks are image areas that do not overlap each other.

In an embodiment of the present invention, step S15 includes:

the transducer module comprises an input layer and a transducer layer, wherein the input layer is a position coding layer, the position coding layer is used for carrying out position coding on the image blocks to obtain a position vector sequence, and then the position vector sequence is embedded into the image block sequence to obtain an image block sequence embedded with position vector information;

the transducer layer comprises 4 encoders, each encoder comprises a multi-head attention layer and a multi-layer perceptron, the image block sequence embedded with the position vector information is input into the transducer layer, the image block sequence is encoded, and the third characteristic is obtained through output.

In one embodiment of the present invention, the multi-layer sensor includes an activation function.

In a specific embodiment of the present invention, the third characteristic is an output characteristic of the last encoder.

In an embodiment of the present invention, the activation function is one of a sigmoid function, a tanh function, and a relu function.

In a specific embodiment of the present invention, the feature dimensions of the second feature and the third feature are the same.

The beneficial effects of the invention are as follows: the invention provides a remote sensing image scene classification method based on a graph rolling network, which comprises the steps of firstly, performing super-pixel segmentation on an image, taking an image block as a basic processing unit, and then performing two branches of processing flows on the image block, wherein the first branch is as follows: extracting a first feature map by using a deep convolution network, up-sampling the first feature map to enable the first feature map to be in a size of an original image block, then constructing a neighborhood map for the image block, establishing a map convolution network by taking a first feature matrix and a neighborhood matrix as map data and combining a space topological relation, and obtaining a second feature embedded in the space topological relation according to a message transmission mechanism of the map convolution network, wherein the second feature can effectively learn the space relation between the feature and a target in a remote sensing image scene, has stronger representation capability and is beneficial to improving classification precision; the second branch is: the method comprises the steps of carrying out position coding and attention feature extraction on an image block according to a transducer module, coding an image block sequence and position vector information based on a multi-head attention mechanism and a multi-layer sensor to obtain a third feature embedded with position information, wherein the multi-head attention mechanism can enlarge the area of a receptive field during calculation, the performance of a model is improved, and the steps can be calculated in parallel, so that the calculated amount is effectively reduced; because the second feature and the third feature have the same dimension, the feature fusion can be easier to carry out, the fused features can be more comprehensive and representative, the method has higher classification accuracy for easily confused remote sensing images and complex remote sensing images, and the problem of target loss during classification is effectively avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

fig. 2 is a training flow chart of an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the invention, fall within the scope of protection of the invention.

Referring to fig. 1 and 2, the present invention provides a remote sensing image scene classification method based on a graph rolling network, which includes:

Acquiring a group of remote sensing images, taking the remote sensing images as training samples to train the scene classification model, and recording the training samples as Represents the f training sample, C, H and W represent the number of channels, height and width of the training sample, y, respectively _i And (5) representing the scene category corresponding to the training samples, wherein N is the number of the training samples.

Super-pixel segmentation is performed on the training sample by adopting the SLIC algorithm to obtain a plurality of image blocks, wherein the image blocks are mutually non-overlapped image areas, and the training sample can be expressed as a sequence of image blocks, such asRepresenting the ith training sample as a sequence of m image blocks, each image block +.>Where p represents the dimension of each tile, anm＝HW/p ² The specific process of super-pixel segmentation is as follows:

(1) Initializing seed points, i.e. cluster centers

And uniformly distributing seed points in the training sample according to the set number of super pixels. If the image of each training sample has N total pixel points, and is pre-segmented into m super pixels with the same size, then the size of each super pixel is N/m, and the distance (step size) between adjacent seed points is approximately s=sqrt (N/m).

(2) Reselection of seed points within a neighborhood

The seed point is reselected within the neighborhood of the seed point 3*3, specifically: and calculating gradient values of all pixel points in the neighborhood, and moving the seed point to the place with the minimum gradient in the neighborhood. The purpose of this is to avoid seed points falling on the contour boundaries with large gradients, so as not to affect the subsequent clustering effect.

(3) Assigning class labels to each pixel point in a neighborhood around each seed point, i.e. to which cluster center it belongs

The search range of SLIC is limited to 2s x 2s, which can accelerate algorithm convergence.

(4) Distance measurement

Including color distance and spatial distance. For each searched pixel point, the distance between the pixel point and the seed point is calculated. The distance calculation method comprises the following steps:

wherein d _c Represents the color distance, d _s Representing the space distance, x and y represent coordinate values of the pixel point, L represents brightness value of the pixel point, and aRepresents the range of pixel points from magenta to green, b represents the range of pixel points from yellow to blue, and N _s Is the maximum spatial distance within the class, N _s =s=sqrt (N/m), applicable to each cluster, N _c For the maximum color distance, which varies with different pictures and also varies with different clusters, the embodiment replaces a fixed constant k, and the value range of k is 1-40, and then the formula (3) is deformed as follows:

the result of equation (4) is the final distance measure. Since each pixel point is searched by a plurality of seed points, each pixel point has a distance from surrounding seed points, and the seed point corresponding to the minimum value is taken as the clustering center of the pixel point.

The color distance is related to the Lab color space, and the Lab color model is composed of three elements, namely, a and b, of the luminance (L) and the related colors. L represents brightness, and the value range of L is from 0 (black) to 100 (white). a represents the range from magenta to green (a negative value indicates green and positive value indicates magenta), b represents the range from yellow to blue (b negative value indicates blue and positive value indicates yellow). Advantages of Lab color space: 1) Unlike the RGB and CMYK color spaces, lab colors are designed to approximate human physiological vision. It aims at perceptual uniformity, whose L component closely matches human luminance perception. And thus can be used to make an accurate color balance by modifying the output tone of the a and b components or to adjust the brightness contrast using the L component. These transformations are difficult or impossible in RGB or CMYK. 2) Lab is considered to be a device-independent color model because Lab describes the manner in which colors are displayed, rather than the amount of a particular colorant that is required by a device (e.g., a display, printer, or digital camera) to generate the colors. 3) The color gamut is broad. It contains not only all the gamuts of RGB, CMYK, but also colors that they cannot. Colors perceived by the naked human eye can be represented by a Lab model. In addition, the Lab color model compensates for the deficiencies of the RGB color model in terms of uneven color distribution, because the RGB model has too many transition colors from blue to green, and lacks yellow and other colors from green to red.

(5) Iterative optimization

And (5) iterating the steps (1) - (4) until the clustering center of each pixel point is not changed.

(6) Enhancing connectivity

The following flaws may occur through the iterative optimization described above: multiple connectivity situations occur, superpixel sizes are undersized, a single superpixel is cut into multiple discrete superpixels, etc., which can be addressed by enhancing connectivity. The main idea is as follows: and (3) creating a marking table, wherein elements in the table are-1, and reassigning discontinuous super pixels and undersize super pixels to adjacent super pixels according to the Z-shaped trend (from left to right and from top to bottom), and assigning traversed pixel points to corresponding labels until all points are traversed.

After the training sample is subjected to super-pixel segmentation to obtain an image block, the image block is divided into two processing flows for feature extraction, and the specific flow steps are as follows:

the first processing flow is as follows: the training samples are input into a deep convolutional network to obtain a first feature map corresponding to the image block, the deep convolutional network in the embodiment is ResNet-50, the image block is input into the ResNet-50 to obtain a four-layer feature map, and the feature map of the last layer, namely the fourth layer, is used as the first feature map. ResNet is a depth residual neutral network, and the network is added with jump connection in each convolution layer to realize feature identity mapping, so that the problem of detail loss during feature compression and feature extraction by the convolution layers can be avoided, and the last layer of feature map contains rich semantic information of images, and the last layer of feature map is used as a first feature map, so that the subsequent classification accuracy can be improved more favorably.

Then, interpolation is carried out on each pixel of the first feature map by adopting a nearest neighbor interpolation method, the first feature map is amplified to be the same as the corresponding image block in size, the up-sampled first feature map is processed according to a maximum pooling method, a first feature matrix corresponding to each first feature map is obtained, and the first feature matrix is recorded asm is the number of image blocks, and t is the first feature dimension corresponding to the image blocks; establishing a space 4 neighborhood relation for the image block, and constructing a neighborhood matrix +.>In this way, the construction of the spatial topological relation inside the scene is realized, and therefore, the neighborhood matrix describes the spatial topological structure among each image block. Upsampling the first feature map allows it to be high resolution while preserving advanced features, preserving image information to some extent.

The first characteristic matrix is used as nodes of the graph structure, the adjacent matrix and the spatial topological relation thereof are combined to establish a graph rolling network, then training is carried out, and the spatial topological relation and the first characteristic matrix are transmitted in a message mode according to the layered propagation rule of the graph rolling network to obtain the characteristic embedded with the spatial topological relation, namely the second characteristicBecause the first feature map is up-sampled, the first feature map is transformed to the original scale after convolution, and the second feature obtained after convolution is also of the same scale, the additional calculation caused by different feature scales can be avoided, and the subsequent calculation and processing flow are simplified.

The second processing flow is as follows: inputting training samples into a transducer module, wherein the training samples are represented by an image block sequence, namely each training sample is the image block sequence, the transducer module comprises an input layer and a transducer layer, the input layer is a position coding layer, the image block comprises a position vector, the position vector can be represented by position coding, the position coding layer is utilized to carry out position coding on the image block to obtain a position vector sequence, and then the position vector sequence is embedded into the image block sequence to obtain the image block sequence embedded with position vector information, wherein the position vector is represented as follows:

where E represents a learnable embedding matrix for projecting image blocks into an embedded representation in m x t dimensions, E _pos For representing the spatial position of the image blocks in the training sample image and encoded into the embedded representation. The position coding is performed in the present embodiment in the following form:

in the above formula, p is the position of the image block in the corresponding image block sequence, d _model The vector length of the position-coding information is the same as the vector length of the second feature, i.e. the dimension is the same, f represents the position of the image block in the training samples. The above formula adds sin variable at even position and cos variable at odd position of each image block to generate spatial position vector with the same dimension as the second feature, then embeds the spatial position vector into the image block sequence according to formula (5), and the result of formula (5) is input of the transducer layer.

the transducer layer comprises 4 encoders, each encoder comprises a multi-head attention layer and a multi-layer perceptron, the image block sequence embedded with the position vector information is input into the transducer layer, the image block sequence is encoded, and the third characteristic is obtained through output. In training the transducer layer, three matrices Z located in the transducer layer are trained ^Q 、Z ^K 、Z ^V The three matrices are multiplied by the input sequence to obtain a query matrix, a key matrix and a value matrix, respectively. The self-attention mechanism is as follows:

in the middle ofQ, K, V are respectively a query matrix, a key matrix and a value matrix, d _k Is the dimension of the input, softmax () represents the softmax function. The use of a multi-head attention mechanism in this embodiment improves the performance of the transducer layer, i.e., using multiple Z' s ^Q 、Z ^K 、Z ^V The matrix generates a plurality of query matrixes, key matrixes and value matrixes, a plurality of characteristic values are output according to a formula (7), then the characteristic values are spliced and multiplied by a matrix parameter to output a characteristic, the multi-layer perceptron is an artificial neural network with a forward structure and comprises two linear connection layers and an activation function, the activation function is one of a sigmoid function, a tanh function and a relu function, the activation function is adopted in the embodiment and can be regarded as a directed graph, the directed graph consists of a plurality of node layers, each layer is fully connected to the next layer, the characteristic output by a multi-head attention layer is input into the multi-layer perceptron to obtain the output characteristic of an encoder, the steps are repeated three times, the output characteristic of the first encoder is the input of a second encoder, and the output characteristic of the last encoder is the third characteristicIn the processing flow, during calculation, the multi-head attention mechanism can enlarge the receptive field area, improve the performance of the model, and the steps can be performed in parallel, so that the calculated amount is effectively reduced.

After the second feature and the third feature are obtained, the two features are fused to obtain a fusion feature, and a fusion formula is as follows:

W＝concat(B′，X′) (8)

w is the fusion feature, B 'is the second feature, X' is the third feature, and concat () represents the merging of the two features. And inputting the fused features into a classification layer to obtain classification results. The fusion characteristics can be more comprehensive and representative, and the method has higher classification accuracy for the confusable images. The classification layer can be an SVM classifier or other classifiers, and when training, the classification result is evaluated according to the classification accuracy, recall rate and F1 score, and the parameters of the scene classification model are adjusted according to the evaluated result.

After training the scene classification model, the remote sensing images to be classified can be subjected to scene classification.

Claims

1. A remote sensing image scene classification method based on a graph rolling network is characterized by comprising the following steps:

2. The remote sensing image scene classification method based on graph rolling network according to claim 1, wherein step S13 comprises:

3. The remote sensing image scene classification method based on graph rolling network as claimed in claim 2, wherein the method for up-sampling the first feature graph is:

4. The method of claim 2, wherein the second feature comprises a representation of the spatial topology.

5. The method of claim 1, wherein the image blocks are non-overlapping image regions.

6. The remote sensing image scene classification method based on graph rolling network according to claim 1, wherein step S15 comprises:

7. The method of claim 6, wherein the multi-layer sensor comprises an activation function.

8. The method of claim 7, wherein the third feature is an output feature of a last encoder.

9. The remote sensing image scene classification method based on a graph-convolution network of claim 7, wherein the activation function is one of a sigmoid function, a tanh function, and a relu function.

10. The method of claim 1, wherein the second feature and the third feature have the same feature scale.