Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem that the identification and tracking method of the mesoscale convection system based on image anchor-frame-free detection is provided aiming at various judgment standards of the traditional identification method of the mesoscale convection system in the meteorological field. Considering that the MCS form scale changes greatly, the current mainstream example segmentation method based on the heuristic preset anchor frame has a poor effect on the application, so that the category of each position in the Feature Map, the four edge distances to the corresponding frame and the production Mask are directly predicted by a pixel-by-pixel full convolution mode, and the MCS is continuously and effectively tracked for multiple times on the basis.
The method specifically comprises the following steps:
step 1, preprocessing an original infrared brightness temperature data file according to the relevant description of satellite data: cutting original satellite data, marking image polygon examples, and randomly dividing a training set, a verification set and a test set;
step 2, constructing a mesoscale convection system example segmentation convolutional neural network based on an anchor-free frame: the method comprises the steps of dividing the method into a main network for feature extraction, a feature pyramid network for fusing multi-scale features, a prediction frame network head for classification, distance frame regression and center-to-center (center-less) prediction, a Mask network head for generating a Mask and a network head for predicting a Mask IoU;
step 3, performing multi-scale image enhancement on the training set, and automatically learning network parameters by using the mesoscale convective system example segmentation convolutional neural network constructed in the step 2 of migration learning supervision training: training by adopting a small batch random gradient descent method, and setting loss functions of three network heads, namely a prediction frame, a generated Mask and a prediction Mask IoU;
step 4, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures at adjacent moments by using the trained network to obtain mesoscale convection system related records at continuous moments;
step 5, on the basis of the step 4, tracking of the mesoscale convection system is realized according to a related target matching principle, and meanwhile, the commonly occurring splitting and merging conditions are considered;
the method comprises the following steps of 1:
the data set used in the present invention is full resolution (4km) light and temperature data (60N-60S) for monitoring global precipitation obtained from 11 micron infrared channels on geostationary satellites GMS-5, GOES-8, Goes-10, Metasat-7, and Metasat-5, as provided by the National Oceanic and Atmospheric Administration, NOAA. In order to reduce discontinuities between adjacent geostationary satellites, the acquired data has been corner angle correlation corrected and finally stored in the form of a rectangular grid. Each infrared bright temperature data file name format is merg _ yyymddhh _4km-pixel, where yyyy represents year (e.g. 2020), mm represents month (range 01-12), dd represents date (range 01-31), and hh represents hour (range 00-23). Each file contains 2 records of light temperature data corresponding to hours 0 and 30, respectively, each record being 9896 × 3298 in size (9896 dimensions span 0.036378335 and 3298 dimensions span 0.036383683). Meanwhile, in order to store data with the size of one byte (8 bits with the range size of 0-255), the data obtained is to subtract 75 on the basis of the real monitored brightness temperature value, and 255 (corresponding to the real brightness temperature value 330) in the obtained data is an absent value.
Step 1-1, reading an infrared brightness temperature data file of each original geostationary satellite according to the related description of satellite data, wherein the file is an array in a format of 2 x 3298 x 9896, and obtaining two gray cloud pictures of 0 minute and 30 minutes at corresponding time;
in the step 1-2, the size of the remote sensing satellite image is 3298 × 9896, and the image is too large and is not suitable for the input of a subsequent example segmentation network, so that the gray cloud image obtained in the step 1-1 is cut. According to the latitude of 27N to 40N and the longitude of 110E to 125E in the Jianghuai region of China, the relevant description of the brightness temperature data is combined, so that the size of the Jianghuai region of China is obtained as follows:
amplifying the result, and cutting each gray cloud image into more than two (generally 240) gray infrared cloud images with the size of 420 × 360;
and 1-3, giving polygon example level labels of the mesoscale convection system for each gray level infrared cloud subgraph, filtering out subgraphs without the mesoscale convection system, and obtaining json files corresponding to each gray level infrared cloud subgraph, wherein the whole label only has one category. And the grayscale infrared cloud sub-images were randomly divided into 9407 training set images, 3135 verification set images, and 3136 test set images at a ratio of 6:2:2, thereby constituting a training set, a verification set, and a test set.
The step 2 comprises the following steps:
step 2-1, constructing a backbone network for feature extraction: the Backbone Network in the split Network of the embodiment of the invention adopts a convolutional neural Network VoVNetV2-99 (reference: An Energy and GPU-computing Efficient backhaul Network for Real-Time Object Detection.) with a total of 99 layers. The convolutional layers in the convolutional neural network VoVNetV2-99 are all in the form of Conv-BN-ReLU, i.e. in turn a combination of convolutional layers with a bulk normalization layer BN and a linear rectification function ReLU. Meanwhile, cross-layer connection similar to the residual error network ResNet is adopted in the convolutional neural network VoVNetV2-99 to realize identity mapping, and a residual block containing the identity mapping is defined as:
Y=F(X,{Wi})+X
where X and Y represent the input and output profiles, F (X, { W), respectively, for each building blocki}) is a residual mapping function that needs to be learned. Meanwhile, in order to improve the quality of feature representation, a channel attention eSE (Effective Squeeze-Excitation) (reference: center mask: Real-Time Anchor-Free distance Segmentation) mechanism is also introduced into the network residual block, so that the network focuses more on important feature map channels and inhibits irrelevant channels, and the specific implementation mode is as follows:
AeSE(Xdiv)=σ(WC(gap(Xdiv)))
wherein X
divFor the diversified feature map obtained by dimensionality reduction after the network feature map cascade connection,
(W and H are the width and height, respectively, of feature X) is channel-level global average pooling, W
CFor full connection layer weight, X
i,jFor the feature value of the feature map X at (i, j), σ is expressed as Sigmoid activation function, A
eSE(X
div) A characterizer for the computed channel attention, said characterizer associated with X
divElement-level multiplication (symbol for element-level multiplication)
Representation) finally obtained refined feature map X
refine。
The specific construction steps of the convolutional neural network VoVNetV2-99 comprise:
step 2-1-1, constructing a network Stem stage 1: this stage contains three convolutional layers: first, the convolutional layer with convolution kernel size of 3 × 3, step size of 2, padding of 1, and output channel number of 64 (if not described otherwise, the convolutional layer with convolution kernel size of 3 × 3 defaults to step size of 1 and padding of 1) is used to perform downsampling on the input image, and then a convolutional layer with convolution kernel sizes of 3 × 3 and output channel number of 64 and a convolutional layer with convolution kernel size of 3 × 3, step size of 2, and output channel number of 128 are connected. After the input image passes through the stage, a first scale feature map C1 is generated, wherein the feature map C1 is 4 compared with the input image scale Output Stride;
step 2-1-2, constructing a network single Aggregation (OSA) module stage 2: this stage contains a residual block containing 5 convolution kernels of size 3 x 3,Outputting convolution layers with 128 channels, obtaining diversified feature maps with 128 channels by 5 convolution layers after convolution operation of input images, performing cascade operation on the feature maps of the layers, the feature maps of the first 4 layers and the input first scale feature map C1 on the last convolution layer to obtain a feature map with 128 × 6 to 768 channels, and performing dimension reduction through convolution layers with convolution kernel size of 1 × 1, step size of 1, padding of 0 and output channel number of 256 to obtain diversified feature maps XdivCombining the channel attention eSE mechanism to obtain the final Xrefine,XrefineElement-level addition with the input signature is also required to achieve identity mapping. Finally, the second scale feature map C2 with the channel number of 256 is obtained at the stage, and the feature map C2 is 4 compared with the input image scale Output Stride;
step 2-1-3, constructing a network single polymerization module stage 3: this phase is similar to the OSA module phase 2, except that it first performs 2-fold downsampling using a 3 × 3 max pooling layer with step size 2 and padding 0, and employs a 3 × 3 adjustable warped convolution in the residual block (ref: Deformable ConvNet v2: More Deformable, Better Results.) instead of regular convolution, the adjustable warped convolution being defined as:
where K is the total number of convolution kernel sample locations (e.g., K is 9 for a convolution kernel size of 3 × 3), and w iskFor the convolution kernel weight, pkFor a predefined offset of the kth position with respect to the central position of the receptive field (e.g., p when K is 9)kE { (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), (1,1) }), x (p) and y (p) are the eigenvalue of the input profile x at position p and the eigenvalue of the output profile y at position p, respectively, Δ pkOffset of the k-th position, Δ m, which can be learnedk∈[0,1]The number of adjustments for the k-th position. The adjustable deformed convolution implementation mode is to add an additional spatial resolution which is the same as that of the conventional convolution in the conventional convolutionThe convolution layer of the rate and the expansion rate is used for learning the offset and the adjustment number of each position of the feature map in the x and y directions of the two-dimensional plane.
The OSA module stage 3 includes 3 residual blocks, each of which includes 5 scalable convolutional layers with convolution kernel size of 3 × 3 and output channel of 160, and as with the network single aggregation module stage 2 in step 2-1-2, the step map, the first 4 layer feature maps and the input second scale feature map C2 are cascaded on the last convolutional layer to obtain a diversified feature map including 256+160 × 5 × 1056 channels, which is the diversified feature map of the first residual block, and similarly, the number of the diversified feature map channels obtained after the second and third residual block cascade operations is 512+160 × 5 × 1312, and the output channel number of the conventional convolutional layer with step size of 1 after feature map cascade operation and 1 × 1 after mapping being 0 is 512, which also includes a channel attention mechanism, and after the operation of the OSA module stage 3, a third scale feature map C3 with channel number of 512 is finally obtained, feature map C3 is 8 compared to the input image scale Output Stride;
step 2-1-4, constructing a network single polymerization module stage 4: this stage is similar to the OSA module stage 3 described above, except that it contains 9 residual blocks, each containing 5 adjustable warped layers with a convolution kernel size of 3 x 3 and an output channel of 192, the same as the network single aggregation module stage 2 and the network single aggregation module stage 3, the feature map of the layer, the feature map of the previous 4 layers and the input second scale feature map C3 are cascaded on the last convolution layer to obtain a diversified feature map with 512+192 × 5 ═ 1472 channels, which is the diversified feature map of the first residual block, and similarly, the number of the diversified feature map channels obtained after the cascade operation of the second to ninth residual blocks is 768+192 × 5 ═ 1728), and the number of output channels of a 1 × 1 conventional convolutional layer with the step size of 1 and the padding of 0 for dimensionality reduction after the cascade of feature maps is 768, and the channel attention mechanism is also included. Finally obtaining a fourth scale feature map C4 after the network single aggregation module stage 4, wherein the feature map C4 is 16 compared with the input image scale Output Stride;
step 2-1-5, constructing the last structure of the backbone network, and performing single aggregation module stage 5: the phase structure is similar to phase 4, but the phase only contains 3 residual blocks, each of which contains 5 adjustable warped convolutional layers with a convolutional kernel size of 3 x 3 and an output channel of 224, the same as the network single aggregation module stage 2 and the network single aggregation module stage 3, the feature map of the layer, the feature map of the first 4 layers and the input second scale feature map C3 are cascaded on the last convolution layer to obtain a diversified feature map with the number of 768+224 × 5-1888 channels, which is the diversified feature map of the first residual block, and similarly, the number of channels of the diversified feature map obtained after the second and third residual blocks are cascaded is 1024+224 × 5-2144, and the number of output channels of the 1 × 1 conventional convolutional layer with the step size of 1 and the padding of 0 for dimensionality reduction after the feature map cascade is 1024, and the channel attention mechanism is also included. After this stage, a feature map C5 is finally obtained, where the feature map C5 is 32 compared to the input image scale Output Stride;
and 2-2, combining a Feature Pyramid Network (FPN) (reference: Feature Pyramid networks for object detection) to fuse the multi-scale features. And (3) carrying out top-down combination on the features { C3, C4, C5} of different scales obtained in the step 2-1 by the feature pyramid network FPN, and combining transverse connection to fuse the features to obtain { M3, M4, M5 }. Wherein M5 is obtained by passing through a 1 × 1 convolutional layer from a feature map C5, M4 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C4 after nearest neighbor 2-fold upsampling by M5, and M3 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C3 after nearest neighbor 2-fold upsampling by M4.
Finally, for each layer in the { M3, M4 and M5}, the aliasing influence caused by nearest neighbor interpolation is relieved through convolution with the convolution kernel size of 3 multiplied by 3 and the number of output channels of 256, and the feature layer { P3, P4 and P5} is obtained.
Additionally adding a P6 feature layer and a P7 feature layer in the example segmentation network, wherein the P6 feature layer and the P7 feature layer are obtained by performing 2-time downsampling on P5 and P6 through 13 × 3 convolutional layer with the step size of 2 respectively, and finally obtaining feature layers { P3, P4, P5, P6 and P7 };
step 2-3, constructing an example segmentation frame prediction header, wherein the example segmentation frame prediction header comprises 2 branches which are respectively a classification branch and a center-to-ness (reference: FCOS: full connected One-Stage Object Detection.) and a multitask branch parallel to distance frame regression, and the method specifically comprises the following steps:
step 2-3-1, constructing a classification branch: after each feature layer (i.e., P3-P7) in the feature pyramid FPN, three conventional convolutional layers with 256 input channels and 256 output channels and 3 × 3 convolutional kernel size and an adjustable warped convolutional layer are sequentially connected, and the four convolutional layers do not adopt Batch Normalization (BN) but use Group Normalization (GN) to avoid the influence of Batch size on the network model. Because only the class of the mesoscale convection system is detected, a convolution layer for predicting classification is added at the end of a classification branch, the number of input channels is 256, the number of output channels is 1, and the size of a convolution kernel is 3 multiplied by 3;
step 2-3-2, constructing a multi-task branch with the centrality and the distance frame regression being parallel: the part is also sequentially connected with three conventional convolution layers with the same structure as the classification branch and an adjustable deformation convolution layer behind the feature pyramid network FPN feature layer of each scale. However, connected after these four convolutional layers is a convolutional layer containing two branches in parallel: the number of output channels for frame regression is 4, namely 3 multiplied by 3 convolution layers, and four dimension values of the branch output respectively represent the distance from the current position to four edges of the frame; 3 x 3 convolutional layers with the number of output channels being 1 for predicting centrality, the branch output being a one-dimensional representation centrality value;
assuming that the distance d from a certain sample point (x, y) on the feature map F to the four sides of the target frame to which the point belongs is (l, t, r, b), where l, t, r, b respectively represent the distances from the sample point (x, y) to the left, upper, right and lower sides of the rectangular frame, the centrality cennteress is defined as:
where min () and max () are minimum and maximum taking functions, respectively. When the example segmented convolutional neural network is used for detection, the predicted centrality value is multiplied by the classification score to obtain the final confidence. The centrality branch mainly suppresses low-quality frames farther from the center point of the target object, and it can be found from the centrality definition that if the sample point (x, y) is closer to the center of the target frame, the centrality value is closer to 1, otherwise, the centrality value is closer to 0, so that the classification of the farther points is multiplied by a smaller centrality value to obtain a lower confidence, so that the low-quality frames regressed by the farther points are more easily filtered in the Non-Maximum Suppression (NMS) stage.
And 2-4, similarly to the generation of the network RPN by the region in the Anchor-based detector such as Mask R-CNN, predicting a plurality of frame regions by the frame head, and obtaining 100 regions of interest RoI (region of interest) after the regions are subjected to partial elimination through non-maximum suppression with IoU threshold of 0.6. In view of the implementation principle of Mask R-CNN, the method for constructing the RoI header of the instance segmentation network based on the non-anchor frame specifically comprises the following steps:
step 2-4-1, constructing a region-of-interest alignment RoI Align (reference: Mask R-CNN.) layer: the RoI Align layer is firstly mapped to a corresponding feature pyramid network FPN feature layer P according to the size of the RoI of the region of interestkThe implementation mode is as follows:
k=Ceil(kmax-log2(Ainput/ARoI))
wherein A isinputAnd ARoIRespectively representing the area of the input image and the area of the RoI, kmaxSetting the number of the last layer of the backbone network as 5, setting Ceil () as an upward rounding function, and setting k as the number of the characteristic layers of the FPN of the characteristic pyramid network mapped by the RoI;
the corresponding position area on the mapped feature map is divided into 196 small areas with the same size by the RoI Align layer, the small areas are sampled in a self-adaptive mode, the central point position of each small area is taken, a bilinear interpolation method is used for calculation, and finally a feature map with the fixed size of 14 x 14 is obtained;
step 2-4-2, constructing a Mask head comprising a spatial attention mechanism: the Mask header contains four convolutional layers with a convolutional kernel size of 3 × 3 and 256 input/output channels. In order to make the branch focus more on the meaningful pixel points, a spatial attention module is introduced after the fourth convolution layer, and the implementation mechanism is as follows:
wherein X
iFor inputting a feature map, A
sag(X
i) Spatial attention feature descriptor, P
maxAnd P
avgA feature map representing maximum pooling and an average pooling in the channel dimension,
representing cascade operation, F
3×3Is a 3 × 3 convolutional layer, σ is a Sigmoid function,
is element-level multiplication, X
sagA characteristic diagram finally combining spatial attention;
then X obtainedsagAnd performing up-sampling by a deconvolution layer with convolution kernel size of 2 multiplied by 2 and step length of 2 to obtain a characteristic diagram with size of 28 multiplied by 28 and same channel number. The convolution layer at the last layer of the Mask head is a Mask prediction layer specific to each category, and the detection target is a category of a mesoscale convection system, so the convolution kernel size of the Mask prediction layer is 1 multiplied by 1, the step length is 1, and the number of output channels is 1;
step 2-4-3, construct maskIoU heads, and use Mask score Mask Scoring (ref: Mask Scoring R-CNN.) to re-express the Mask quality: and 2-4-2, performing 2-time down-sampling on the output characteristic diagram of the Mask prediction layer by adopting a 2 x 2 maximum pooling layer, and then cascading the output characteristic diagram with the RoI characteristic diagram with the output size of 14 x 14 and the channel number of 256 of the RoI Align layer to obtain a characteristic diagram with the size of 28 x 28 and the channel number of 257.
The maskIoU header contains four successive convolutional layers with convolutional kernel size of 3 × 3 and output channel number of 256, where the step size of the last convolutional layer is 2. Then, 2 full-connection layers with the output channel number of 1024 and 1 full-connection layer with the output channel number of 1 are also connected;
the step 3 of the invention comprises:
step 3-1, because the infrared cloud atlas of the data set is a single-channel gray image, the gray image in the training set is converted into three channel images of RGB (red, green and blue) for subsequent transfer learning, the values of the three channels of RGB are the same and are the gray values of the gray image, and data enhancement is performed on the converted images in the training set and corresponding example labels: the image is first scaled to multiple scales, with the long side at 1333 and the short side at a random one of {640,672,704,736,768,800}, and the original scale of the image is preserved while also data enhancement is performed using random horizontal flipping.
The pixel values are centrally processed, and since the training uses a pre-trained model on the ImageNet dataset, the images need to be normalized according to the mean value in the ImageNet dataset. The normalization operation is carried out according to channels, the mean values of R, G, B channels are 123.675, 116.28 and 103.53 respectively, the standard deviations are 58.395, 57.12 and 57.375 respectively, and the mean scalar combination of the RGB three channels is expressed as a vector
The scalar combination of the standard deviation of three channels of RGB is expressed as a vector sigma, an input image is set as x, and the normalized image data x' is set as
Then, filling the image scale into a multiple of 32 to avoid feature loss caused by subsequent convolution operation;
correspondingly, the corresponding instance labels of the input image after zooming and horizontal turning data enhancement are also subjected to the same transformation to obtain correct labels;
step 3-2, setting a classification loss function of the head of the predicted frame in the example segmentation network: the classification task uses the Focal Loss function, Focal local, to ameliorate the class imbalance problem in the one-stage detector. Since only one class of labels is included in the dataset, only one two-classifier needs to be trained. Consider that the network treats each location as a training sample rather than an anchor box, let feature graph F
i(i-3, 4.. 7) position (x) obtained by classification branching
i,y
i) Predicted value is
Is provided with
Wherein
Indicating whether it is a positive or negative sample,
indicates the position (x)
i,y
i) In the case of a positive sample,
then the position (x) is indicated
i,y
i) In the form of a negative sample, the sample,
is a position (x)
i,y
i) Probability of being a positive sample;
will position (x)i,yi) The corresponding position (x ', y') mapped to the input image is:
wherein s is
iIs a characteristic diagram F
iCompared to the Output Stride of the input image scale, if (x ', Y') falls into any one of the Group Truth (GT) frames, then the Y is 1 for the positive sample, otherwise the Y is 0. Alpha balance based focus loss function
Expressed as:
where α is the weighting factor, γ ≧ 0 is the adjustable focusing parameter, and typically α and γ are set to 0.25 and 2.0, respectively. For feature map F
iTotal classification loss function of
Comprises the following steps:
step 3-3, setting a centrality loss function: given feature map F
iMiddle and positive sample feature point (x)
i,y
i) Distance frame regression target
Wherein
Respectively represent characteristic points (x)
i,y
i) Distances to the left, upper, right and lower sides of the true border. Then the feature point (x)
i,y
i) Is a centrality target
Is defined as:
it is obvious that
Therefore, during training, the central degree branch adopts a binary cross entropy loss function. Let characteristic diagram F
iObtaining position (x) through centrality branching
i,y
i) Predicted value of (2)
Then the characteristic diagram F
iTotal centrality loss function
Comprises the following steps:
wherein
Indicating whether it is a positive or negative sample,
indicates the position (x)
i,y
i) In the case of a positive sample,
then the position (x) is indicated
i,y
i) Are negative examples.
To indicate a function, if the condition in parentheses
If yes, the indication function value is 1, otherwise, the value is 0;
step 3-4, setting a distance regression loss function: the regression task adopts a GIoU loss letterNumber, set distance
Wherein
Respectively represent characteristic points (x)
i,y
i) Distances to the left, top, right, and bottom of the predicted bounding box. Regression mesh distance target
Wherein
Respectively represent characteristic points (x)
i,y
i) Distances to the left, upper, right and lower sides of the true border. If position (x)
i,y
i) And selecting the frame with the smallest area from the plurality of GT frames as a distance regression target. Will be at distance
And
viewed as a circumscribed border, the GIoU loss function is expressed as:
wherein C is
And
the outer border of the two is represented by | l which represents the area of the region,
indicates that the region C does not contain
And
the area of the part(s) is,
is composed of
And
the cross-over-cross-over ratio of (c),
is composed of
And
the generalized cross-over ratio of (1).
Feature map FiThe total distance regression loss function is:
wherein
The meaning is the same as in step 3-3,
for the loss function element level weighting coefficients, the value is the position (x)
i,y
i) A centrality regression target;
step 3-5, setting a Mask loss function: predicting the bounding box header before passing through the Mask Head results in a plurality of proposed bounding boxes, and obtaining the RoI of the maximum number of each picture of 100 according to a score threshold of 0.05 and a non-maximum suppression threshold of IoU of 0.6. To improve the convergence rate and improve the detection performance, a GT frame is also added to the RoI for network training. Setting N total RoIs added with GT frames and K total GT frames, calculating IoU between the RoI and the GT, dividing the RoI into positive and negative samples according to a IoU threshold value of 0.5, wherein a positive sample label is 1 and a negative sample label is 0, obtaining a dictionary D corresponding to the ith RoI and the jth RoI which is most matched with the ith RoI, which belongs to 0, N) and related information of the GT frames, and finally sampling according to the proportion that the positive samples account for all the samples 1/4 to obtain a training sample X for Mask Head.
Since Mask Loss is defined only on the positive sample, foreground screening of the training sample X is also required. Setting the head of the prediction frame to obtain M positive samples RoI and the ith RoIiAlignment of RoIAlign through the region of interest yields a feature map F of 14 × 14 sizei,FiObtaining a prediction feature map pred with 28 multiplied by 28 size after Mask Headi. Mask target gt _ MaskiFirstly, finding GT information (including category, frame and polygon Mask) according to the RoI serial number of D, then cutting the frame area on the original Mask according to the frame information, finally, re-adjusting the cut part to 28 × 28 to obtain the final GT _ Maski. Computing L using an average two-value cross entropy loss functionmask:
Wherein
And
is a feature map pred
iAnd gt _ mask
iA characteristic value at an (x, y) position;
step 3-6, settingmaskIoU loss function: the predicted value obtained by intersecting the ith characteristic diagram after cascade operation through the prediction mask and comparing the ith characteristic diagram with the MaskIoU Head of the Head is pred _ MaskIoU
iMaskIoU is targeted to gt _ MaskIoU
iThe value is obtained in step 3-5 according to the predicted Mask and the Mask information in the corresponding GT to obtain GT _ Mask iou
i. By using
Calculating L by a loss function
maskiou:
Step 3-7, setting a multitask loss function: adding the classification loss function, the centrality loss function, the distance regression loss function, the Mask loss function and the maskIoU loss function to obtain a total multi-task loss function L of the small batch:
wherein N iscls、Nctn、Nreg、Nmask、NmaskIoUIs a normalized coefficient of the corresponding loss function, and Ncls=Nctm=NregTo predict the number of positive samples in the bounding box header, Nmask=NmaskIoUThe number of positive samples obtained for the proposed frames proposals predicted in the prediction frame header based on the IoU threshold and the ratio of sampled positive and negative samples. Lambda [ alpha ]ctn、λcls、λcls、λclsThe balance coefficients for each loss are all 1;
and 3-8, setting relevant parameters of network learning and training.
The steps 3-8 comprise: transfer learning is performed using pre-trained weights of VovNetV2-99 on ImageNet. Selecting 1000 propulses with highest confidence on the pre-frame measurement result, and reserving 100 propulses through non-maximum inhibition, wherein the non-maximum inhibition is performed on NMSIoU the threshold was 0.6. The propusals with IoU value greater than 0.5 with the true mark box is considered as positive samples in Mask Head, otherwise, negative samples, and the number of positive samples used for training accounts for 1/4 of all training samples. Meanwhile, the network training adopts small-batch random gradient descent to optimize the network model, the batch size is 2, the learning rate is 0.0025, a Warmup strategy is adopted in the initial 1000 iterations, and meanwhile Momentum with the coefficient of 0.9 is used. In terms of regularization, 10 is adopted-4The weight attenuation coefficient of (2). The whole training epoch is 24 rounds, a learning rate step decreasing strategy is adopted in the 16 th round and the 22 nd round, and the decreasing coefficient is 0.1. After the learning parameters are set, training the constructed network by using the training set processed in the step 3-1;
the step 4 of the invention comprises:
step 4-1, performing the same processing as in step 3-1 on the image needing instance segmentation, except that only single-scale scaling is performed on the image during inference, namely the long side is 1333 and the short side is 800;
step 4-2, the trained network performs forward calculation on the test image obtained in the step 1-3, proposed frames propalss with confidence coefficient smaller than 0.05 are removed from the head of a predicted frame, 1000 proposed frames propalss with highest confidence coefficient are screened out, 50 RoIs at most are obtained through non-maximum inhibition NMS for Mask branch prediction, and finally a kth Mask (but the data used in the text only comprises one type) is obtained according to the type k of the classified branch prediction, the Mask with the size of 28 × 28 is scaled to the size of the corresponding RoI scale, and binarization is performed through a threshold value of 0.5 to generate a final Mask;
step 4-3, predicting Mask intersection by Top-50 propulses through the maskIoU head, and multiplying the Mask intersection by the classification confidence to obtain the Mask confidence;
step 4-4, adopting NMS (network management system) according to the Mask generated in the step 4-2 and the Mask confidence coefficient obtained in the corresponding step 4-3, wherein the threshold value of IoU is 0.5, and screening out the final prediction Mask;
and 4-5, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures of the Jianghuai region of China at adjacent moments to obtain mesoscale convection system examples at continuous moments.
The step 5 of the invention comprises:
step 5-1, obtaining the mass center coordinate of each mesoscale convection system example
Characteristic area, strength P:
area=N
where N is the total number of pixel points, x, encompassed by the mesoscale convective system exampleiAnd yiIs the abscissa and ordinate of the ith pixel point, f (i) is the brightness temperature value of the ith pixel point, fzThe sum of the lighting temperatures of the pixels in the example is accumulated, and the image resolution is required to be combined in the subsequent calculation;
step 5-2, solving the duration, the position change of the centroid point, the area change and the intensity change of the mesoscale convection system at two adjacent moments according to the time sequence; if the position change of the centroid point and the area change of the centroid point are less than or equal to 50m/s and less than or equal to 5km in two mesoscale convection systems at adjacent moments2The intensity change is less than or equal to 0.001 ℃/s, and the initial judgment is the same target;
and 5-3, considering that the splitting and merging phenomena commonly existing in the mesoscale convection system need to be correspondingly processed. If there are n mesoscale convective systems MCS at a certain time, it is recorded as X1,X2,…,XnAnd one MCS at the previous moment is recorded as PjIf the tracking matching principle in the step 5-2 is satisfied, the phenomenon belongs to the splitting condition, and X with the largest product after splitting is selectedmContinuation PjM is more than or equal to 1 and less than or equal to n; updating XmThe index value of the previous time of the path is j and the duration is PjDuration +1, while other MCSs are considered newThe index value of the cloud picture at the previous moment is empty and the duration is initialized to 1;
if more than two MCSs are marked as P at the previous moment1,P2,…,PzOne MCS associated with this time is denoted PaIf the MCS tracking matching principle is satisfied, the phenomenon belongs to the merging situation, and P with the largest area at the previous time is selectedm(1. ltoreq. m. ltoreq.z) as PaLast moment track of, update PmThe index value of the previous time of the path is m and the duration is PmDuration of + 1; and when the life cycle of the MCS with the non-maximum area at the previous moment is finished, judging whether the duration time is more than or equal to 1h, and if so, obtaining the corresponding MCS path in a reverse order according to the index value at the previous moment, thereby realizing the tracking of the MCS.
Has the advantages that: MCS identification and tracking are core concerns of meteorological disaster forecasting, in the past, MCS identification methods are usually based on traditional image characteristics, the methods depend on selection of a judgment threshold value, and the whole process involves more image processing technologies and is more complicated. The invention adopts a method based on a deep convolution neural network to identify the MCS, and avoids the problems of sensitivity to parameters such as the size of the anchor frame dimension, the transverse-longitudinal ratio and the quantity of the anchor frame in a detection method based on a heuristic preset anchor frame in a full convolution mode without the anchor frame. Meanwhile, the invention further improves the modeling capability of the MCS geometric deformation by combining with the deformable convolution, and focuses on the important channel or position information more. Compared with other deep learning segmentation methods, the method not only achieves better segmentation performance on MCS identification, but also has fewer network parameters and faster training and inference speed.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the workflow of identification and tracking of the mesoscale convection system constructed by the method of the present invention can be roughly divided into four stages: the first stage, the data of original satellite data is preprocessed and labeled; the second stage, constructing an instance segmentation network model; in the third stage, training and deducing a network model; and a fourth stage, tracking the detected continuous time mesoscale convection system example. The method for identifying and tracking the mesoscale convection system in the embodiment of the invention specifically comprises the following construction steps:
step 1, as the mesoscale convection system mostly occurs in summer, the infrared brightness and temperature data of the geostationary satellite in the year 2000 to the year 2017 and in the month 6 to the month 9 provided by the national oceanic and atmospheric administration are partially randomly screened to serve as the operation data of the embodiment of the invention. In addition, the screened original satellite data needs to be preprocessed:
(1) reading an array of each original geostationary satellite infrared brightness temperature data file in a format of 2 x 3298 x 9896 according to the related description of the satellite data to obtain a gray cloud chart as shown in fig. 2 at corresponding time points of 0 and 30, wherein a blank part with a pixel value of 255 in the gray cloud chart is regarded as a missing value;
(2) and (3) cutting the gray cloud image obtained in the step 1-1 because the size of the remote sensing satellite image is 3298X 9896, and the image is too large and is not suitable for the input of a subsequent instance segmentation network. According to the latitude of 27N to 40N and the longitude of 110E to 125E in the Jianghuai region of China, the relevant description of the brightness temperature data is combined, so that the size of the Jianghuai region of China is obtained as follows:
amplifying the result, and cutting the whole gray cloud picture into a plurality of sub-pictures of 420 x 360;
(3) and giving polygon example level labels of the mesoscale convection system for each gray level infrared cloud subgraph, filtering out subgraphs without the mesoscale convection system, and obtaining json files corresponding to each image, wherein the whole label only has one category. And randomly dividing the sub-image into 9407 training set images, 3135 verification set images and 3136 test set images at a ratio of 6:2: 2;
step 2, constructing an anchor-frame-free mesoscale convection system example segmentation convolutional neural network, wherein the example segmentation network structure is shown in fig. 3, and the example segmentation network structure in fig. 3 comprises: a backbone network for extracting deep abstract features of the image, wherein the backbone network comprises five stage feature maps of C1, C2, C3, C4 and C5; the feature pyramid network fusing the multi-scale features also comprises five feature layers P3, P4, P5, P6 and P7 with different scales; a network header for a predicted bounding box, comprising two large branches: classifying large Classication branches, large branches with the centrality Center-Ness branch and the distance frame Regression branch in parallel, wherein the head is shared by feature pyramid layers with different scales; a network head of a prediction Mask, which contains a space attention mechanism SAM and needs to perform region-of-interest alignment RoIAlign before entering the head; the input feature map is obtained by performing maximum pooling Maxpooling downsampling on a Mask head output feature map and performing cascade operation Concat on a feature map obtained by RoIAlign;
and 2-1, constructing a backbone network for feature extraction. The backbone network in the split network of the embodiment of the invention adopts a convolutional neural network VoVNetV2-99 with a total of 99 layers. The convolutional layers in the backbone network are all in the form of Conv-BN-ReLU, namely, the combination of the convolutional layers, the batch normalization layer BN and the linear rectification function ReLU is sequentially carried out. Meanwhile, cross-layer connection in a similar residual error network ResNet is adopted in the network structure block to realize identity mapping, and the residual error block containing the identity mapping is defined as follows:
Y=F(X,{Wi})+X
where X and Y represent the input and output profiles, F (X, { W), respectively, for each building blocki}) is a residual mapping function that needs to be learned. Meanwhile, in order to improve the quality of feature representation, a channel attention eSE (Effective squeee-Excitation) mechanism is also introduced into the network residual block, so that the network focuses more on important feature map channels and inhibits irrelevant channels, and the specific implementation mode is as follows:
AeSE(Xdiv)=σ(WC(gap(Xdiv)))
wherein X
divFor the diversified feature map obtained by dimensionality reduction after the network feature map cascade connection,
(W and H are the width and height of feature X) is the channel-level global average pooling, W
CFor full connection layer weight, X
i,jFor the feature value of the feature map X at (i, j), σ is expressed as Sigmoid activation function, A
eSE(X
div) Feature descriptors for calculated channel attention, the descriptors and X
divElement level multiplication
Finally obtaining the refined characteristic diagram X
refineThe residual block structure combining the identity mapping and the channel attention mechanism is shown in fig. 4. The structure of the whole backbone network VoVNetV2-99 specifically includes the following:
a. stem stage 1: this stage comprises three convolutional layers: first, the convolutional layer with convolution kernel size of 3 × 3, step size of 2, padding of 1, and output channel number of 64 (if not described otherwise, the convolutional layer with convolution kernel size of 3 × 3 defaults to have step size of 1 and padding of 1) is used to perform downsampling on the input image, and then a convolutional layer with convolution kernel sizes of 3 × 3 and output channel number of 64 and a convolutional layer with convolution kernel size of 3 × 3, step size of 2, and output channel number of 128 are connected. After the input image passes through the stage, a first scale feature map C1 is generated, wherein the feature map C1 is 4 compared with the input image scale Output Stride;
b. single polymerization OSA (One-Shot Aggregation) module stage 2: the stage comprises a residual block, wherein the residual block comprises 5 convolution layers with convolution kernel size of 3 × 3 and output channel of 128, the five convolution layers of the input image are respectively subjected to convolution operation to obtain a diversified feature map with channel number of 128, and the feature map of the layer is subjected to cascade operation on the last convolution layer, the feature map of the previous 4 layers and the input first scale feature map C1 to obtain a feature map with channel number of 128 × 6-768. Then, dimension reduction is carried out through a convolution layer with the convolution kernel size of 1 multiplied by 1, the step length of 1, the padding of 0 and the output channel number of 256 to obtain diversified feature maps XdivCombining the channel attention mechanism eSE to obtain the final Xrefine,XrefineElement-level addition with the input signature is also required to achieve identity mapping. Finally, this stage results in a second scale feature map C2, feature map C2 still being 4 compared to the input image scale Output Stride;
osa module stage 3: this phase is similar to the OSA module phase 2, except that it first performs 2-fold downsampling using a 3 × 3 max pooling layer with step size 2 and padding 0, and employs a 3 × 3 adjustable warped convolution instead of the conventional convolution in the residual block, which can be defined as:
where K is the total number of convolution kernel sample locations (e.g., convolution kernel size of 3)3 then K is 9), wkFor the convolution kernel weight, pkFor a predefined offset of the kth position with respect to the central position of the receptive field (e.g., p when K is 9)kE { (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), (1,1) }), x (p) and y (p) are the characteristic values at position p of input profile x and output profile y, respectively, Δ pkOffset of the k-th position, Δ m, which can be learnedk∈[0,1]The number of adjustments for the k-th position. The adjustable deformation convolution implementation mode is to add a convolution layer with the same spatial resolution and expansion rate as the conventional convolution in the conventional convolution to learn the offset and the adjustment number of each position of the feature map in the x and y directions of the two-dimensional plane.
The OSA module stage 3 includes 3 residual blocks, each of which includes 5 scalable warped convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 160, and, like the OSA module stage 2 in step 2-1-2, the method also includes performing a cascade operation on the layer feature map, the first 4 layer feature maps and the input second scale feature map C2 on the last convolutional layer to obtain a diversified feature map with 256+160 × 5 ═ 1056 channels (this is the diversified feature map of the first residual block, and similarly, the diversified feature map obtained after the cascade operation of the second and third residual blocks is 512+160 × 5 ═ 1312), and the number of output channels used for the conventional convolutional with a step size of 1 after the feature map cascade operation and a step size of 1 × 1 after the map cascade operation is 0 is 512, and also includes the channel force mechanism. After the operation at this stage, a third scale feature map C3 is finally obtained, where the feature map C3 is 8 compared with the input image scale Output Stride;
osa module stage 4: this stage is similar to the OSA module stage 3 described above, except that it contains 9 residual blocks, each of which contains 5 scalable convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 192, and as in the OSA module stage 2 and stage 3, the step-wise concatenation of the layer feature map, the first 4 layer feature maps and the input second scale feature map C3 on the last convolutional layer results in a diversified feature map containing 512+192 × 5 ═ 1472 channels (this is the diversified feature map of the first residual block, and similarly the number of diversified feature map channels obtained after the second to ninth residual block concatenation operations is 768+192 × 5 ═ 1728), and the number of conventional convolutional output channels of 1 × 1 with a step size of 1 for the step-down of the feature map concatenation layer and a padding of 0 is 768, and also contains the channel attention mechanism. Finally obtaining a fourth scale feature map C4 after passing through the convolutional layers, wherein the feature map C4 is 16 compared with the input image scale Output Stride;
osa module stage 5: the stage structure is similar to stage 4, but this stage only includes 3 residual blocks, each of which includes 5 adjustable warped convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 224, and as in stage 2 and stage 3 of the OSA module, the method also performs a cascade operation on the feature map of this layer, the feature map of the first 4 layers and the input second scale feature map C3 on the last convolutional layer to obtain a diversified feature map 1888 channels (this is the diversified feature map of the first residual block, and similarly the number of the diversified feature map channels obtained after the cascade operation of the second and third residual blocks is 1024+224 5 × 2144), and the number of output channels of the conventional convolutional layer with a step size of 1 for dimension reduction after the cascade operation of the feature map and a padding of 0 is 1024, and also includes a channel attention mechanism. After this stage, a feature map C5 is finally obtained, where the feature map C5 is 32 compared to the input image scale Output Stride;
and 2-2, combining a Feature Pyramid Network (FPN) to fuse the multi-scale features. FPN top-down and lateral connection combination of the FPN on the different scale features { C3, C4, C5} in step 2-1 to fuse the features to obtain { M3, M4, M5 }. Wherein M5 is obtained by passing through a 1 × 1 convolutional layer from a feature map C5, M4 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C4 after nearest neighbor 2-fold upsampling by M5, and M3 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C3 after nearest neighbor 2-fold upsampling by M4. Finally, for each layer in the { M3, M4 and M5}, the aliasing influence caused by nearest neighbor interpolation is relieved through convolution with the convolution kernel size of 3 multiplied by 3 and the number of output channels of 256, and the feature layer { P3, P4 and P5} is obtained. In the split network of this example, P6 and P7 feature layers are additionally added, which are obtained by performing 2-fold down-sampling on P5 and P6 through 13 × 3 convolutional layer with step size of 2, respectively, to finally obtain feature layers { P3, P4, P5, P6, P7 };
step 2-3, constructing an example segmentation frame prediction head, wherein the part totally comprises 2 branches which are respectively a classification branch and a multitask branch with a centrality (center-less) and a distance frame regression in parallel:
step 2-3-1, constructing a classification branch: three conventional convolutional layers with 256 input channels and 256 output channels and 3 × 3 convolution kernel size and one adjustable deformable convolutional layer are connected in sequence after each feature layer (i.e., P3-P7) in the FPN, and the four convolutional layers do not adopt Batch Normalization (BN) but use Group Normalization (GN) to avoid the influence of Batch size on the network model. Because only the class of the mesoscale convection system is detected, a convolution layer of prediction classification is added at the end of the branch, the number of input channels is 256, the number of output channels is 1, and the size of the convolution kernel is 3 multiplied by 3;
step 2-3-2, constructing a multi-task branch with the centrality and the distance frame regression being parallel: the part is also sequentially connected with three conventional convolution layers with the same structure as the classification branch and an adjustable deformation convolution layer after the FPN characteristic layer of each scale. However, connected after these four convolutional layers is a convolutional layer containing two branches in parallel: the number of output channels for frame regression is 4, namely 3 multiplied by 3 convolution layers, and four dimension values of the branch output respectively represent the distance from the current position to four edges of the frame; 3 x 3 convolutional layers with the number of output channels being 1 for predicting centrality, the branch output being a one-dimensional representation centrality value;
assuming that the distance d from a certain sample point (x, y) on the feature map F to the four sides of the target frame to which the point belongs is (l, t, r, b), where l, t, r, b respectively represent the distances to the left, upper, right and lower sides of the rectangular frame, the centrality is defined as:
where min () and max () are minimum and maximum taking functions, respectively. When the example segmented convolutional neural network is used for detection, the predicted centrality value is multiplied by the classification score to obtain the final confidence. The centrality branch mainly suppresses low-quality frames farther from the center point of the target object, and it can be found from the centrality definition that if the sample point (x, y) is closer to the center of the target frame, the centrality value is closer to 1, otherwise, the centrality value is closer to 0, so that the classification of the farther points is multiplied by a smaller centrality value to obtain a lower confidence, so that the low-quality frames regressed by the farther points are more easily filtered in the Non-Maximum Suppression (NMS) stage.
And 2-4, similarly to the generation of the network RPN by the areas in the Anchor-based detector such as Mask R-CNN, predicting a plurality of frame areas by the frame head, and obtaining 100 RoIs (region of interest) after the areas are subjected to partial elimination through non-maximum suppression with IoU threshold of 0.6. In view of the implementation principle of Mask R-CNN, an instance segmentation network RoI head based on an anchor-free frame is constructed, and the method specifically comprises the following steps:
step 2-4-1, constructing a region of interest alignment RoI Align layer: the layer is firstly mapped to a corresponding FPN characteristic layer P according to the size of the region of interest RoIkThe implementation mode is as follows:
k=Ceil(kmax-log2(Ainput/ARoI))
wherein A isinputAnd ARoIRepresenting the areas, k, of the input image and the RoI, respectivelymaxSetting the number of the last layer of the backbone network as 5, enabling the Ceil () to be an upward rounding function, and enabling k to be the number of the FPN characteristic layers mapped by the RoI;
then the layer divides the corresponding position area on the mapped feature map into 14 × 14-196 small areas with the same size, adaptively samples the areas, takes the central point position of each part, and calculates by using a bilinear interpolation method to finally obtain a feature map with 14 × 14 fixed size;
step 2-4-2, constructing a Mask head containing a spatial attention mechanism: the part contains four convolutional layers with continuous convolutional kernel size of 3 × 3 and input/output channel number of 256. In order to make the branch focus more on the meaningful pixel points, a spatial attention module is introduced after the fourth convolution layer, and the implementation mechanism is as follows:
wherein X
iFor inputting a feature map, A
sag(X
i) Spatial attention feature descriptor, P
maxAnd P
avgA feature map representing maximum pooling and average pooling over the channel dimension,
representing cascade operation, F
3×3Is a 3 × 3 convolutional layer, σ is a Sigmoid function,
is element-level multiplication, X
sagA characteristic diagram finally combining spatial attention;
then X obtainedsagAnd lifting and adopting a deconvolution layer with convolution kernel size of 2 multiplied by 2 and step length of 2 to obtain a characteristic diagram with size of 28 multiplied by 28 and same channel number. The convolution of the last layer of the Mask head is a Mask prediction layer specific to each class, and the detection target is a class of a mesoscale convection system, so the convolution kernel size of the prediction layer is 1 multiplied by 1, the step length is 1, and the number of output channels is 1;
step 2-4-3, a maskolou header is constructed, which re-represents the Mask quality using Mask score Mask Scoring: firstly, on the basis of the step 2-4-2, the output characteristic diagram of a Mask prediction layer is subjected to 2-time down-sampling by adopting a 2 multiplied by 2 maximum pooling layer and then is cascaded with the RoI characteristic diagram with the output size of 14 multiplied by 14 and the channel number of 256 of the RoI Align layer to obtain the characteristic diagram with the size of 28 multiplied by 28 and the channel number of 257. The maskIoU header contains four successive convolutional layers with convolutional kernel size of 3 × 3 and output channel number of 256, where the step size of the last convolutional layer is 2. Then, 2 full-connection layers with the output channel number of 1024 and 1 full-connection layer with the output channel number of 1 are also connected;
step 3, training the instance segmentation network model, specifically comprising the following steps:
and 3-1, converting the gray image into an RGB three-channel image for subsequent transfer learning because the data set infrared cloud image is a single-channel gray image, wherein the three channel images have the same value and are the original gray values. And performing data enhancement on the converted training set images and corresponding instance labels: the image is first scaled to multiple scales, with the long side at 1333 and the short side at a random one of {640,672,704,736,768,800}, and the original scale of the image is preserved while also data enhancement is performed using random horizontal flipping.
The pixel values are centrally processed, and since the training uses a pre-trained model on the ImageNet dataset, the images need to be normalized according to the mean value in the ImageNet dataset. The normalization operation is carried out according to channels, the mean values of R, G, B channels are 123.675, 116.28 and 103.53 respectively, the standard deviation is 58.395, 57.12 and 57.375, and the mean scalar combination of the RGB three channels is expressed as a vector
The scalar combination of the standard deviation of three channels of RGB is expressed as a vector sigma, and an input image is set as x, then the normalized image data x' is as follows:
then, filling the image scale into a multiple of 32 to avoid feature loss caused by subsequent convolution operation;
correspondingly, the corresponding instance labels of the input image after zooming and horizontal turning data enhancement are also subjected to the same transformation to obtain correct labels;
step 3-2, setting a classification loss function of the head of the predicted frame in the example segmentation network: the classification task uses the Focal Loss function, Focal local, to ameliorate the class imbalance problem in the one-stage detector. Since only one class of labels is included in the dataset, only one two-classifier needs to be trained. Consider that the network treats each location as a training sample rather than an anchor box, let feature graph F
i(i-3, 4.. 7) position (x) obtained by classification branching
i,y
i) Predicted value is
Is provided with
Wherein
Indicating whether it is a positive or negative sample,
indicates the position (x)
i,y
i) In the case of a positive sample,
then the position (x) is indicated
i,y
i) In the form of a negative sample, the sample,
indicates the position (x)
i,y
i) Is the probability of a positive sample. Will position (x)
i,y
i) The mapping to the corresponding positions of the input image is as follows:
wherein s isiIs a characteristic diagram FiIf (x ', y') falls into any one of the group Truth (true mark), as compared to the input image scale Output StrideGT) frame is then a positive sample with Y equal to 1, otherwise Y equal to 0. The focus loss function based on alpha balance can be expressed as:
where α is the weighting factor, γ ≧ 0 is the adjustable focusing parameter, and α and γ are set to 0.25 and 2.0, respectively, in the experiment. For feature map FiThe overall classification loss function is:
step 3-3, setting a centrality loss function: given feature map F
iMiddle and positive sample feature point (x)
i,y
i) Distance frame regression target
Wherein
Is expressed as a characteristic point (x)
i,y
i) Distances to the left, upper, right and lower sides of the true border. Then the feature point (x)
i,y
i) The centrality target of (a) is defined as:
it is obvious that
Therefore, during training, the central degree branch adopts a binary cross entropy loss function. Let characteristic diagram F
iObtaining position (x) through centrality branching
i,y
i) Predicted value of (2)
Then the characteristic diagram F
iTotal centrality lossThe function is:
wherein
Indicating whether it is a positive or negative sample,
indicates the position (x)
i,y
i) In the case of a positive sample,
then the position (x) is indicated
i,y
i) Are negative examples.
To indicate a function, if the condition in parentheses
If yes, the indication function value is 1, otherwise, the value is 0;
step 3-4, setting a distance regression loss function: the regression task adopts a GIoU loss function and sets the distance
Wherein
Is expressed as a characteristic point (x)
i,y
i) Distances to the left, top, right, and bottom of the predicted bounding box. Regression distance target
Wherein
Is expressed as a characteristic point (x)
i,y
i) To the left, upper, right and lower of the true frameThe distance of the edges. If position (x)
i,y
i) And selecting the frame with the smallest area from the plurality of GT frames as a distance regression target. Will be at distance
And
viewed as a circumscribing bounding box, the GIoU loss function can be expressed as:
wherein C is
And
the outer border of the two is represented by | l which represents the area of the region,
indicates that the region C does not contain
And
the area of the part(s) is,
is composed of
And
the cross-over-cross-over ratio of (c),
is composed of
And
the generalized cross-over ratio of (1). Feature map F
iThe total distance regression loss function is:
wherein
The meaning is the same as in step 3-3,
for the loss function element level weighting coefficients, the value is the position (x)
i,y
i) A centrality regression target;
step 3-5, setting a Mask loss function: predicting the bounding box header before passing through the Mask Head results in a plurality of proposed bounding boxes, and obtaining the RoI of the maximum number of each picture of 100 according to a score threshold of 0.05 and a non-maximum suppression threshold of IoU of 0.6. To improve the convergence rate and improve the detection performance, a GT frame is also added to the RoI for network training. Setting N total RoIs added with GT frames and K total GT frames, calculating IoU between the RoI and the GT, dividing the RoI into positive and negative samples according to a IoU threshold value of 0.5, wherein a positive sample label is 1 and a negative sample label is 0, obtaining a dictionary D corresponding to the ith RoI and the jth RoI which is most matched with the ith RoI, which belongs to 0, N) and related information of the GT frames, and finally sampling according to the proportion that the positive samples account for all the samples 1/4 to obtain a training sample X for Mask Head.
Since Mask Loss is defined only on the positive sample, foreground screening of the training sample X is also required. Set up to obtain MPositive sample RoI, ith RoIiAlignment of RoIAlign through the region of interest yields a feature map F of 14 × 14 sizei,FiObtaining a prediction feature map pred with 28 multiplied by 28 size after Mask Headi. Mask target gt _ MaskiFirstly, finding GT information (including category, frame and polygon Mask) according to the RoI serial number of D, then cutting the frame area on the original Mask according to the frame information, finally, re-adjusting the cut part to 28 × 28 to obtain the final GT _ Maski. Computing L using an average two-value cross entropy loss functionmask:
Wherein
And
is a feature map pred
iAnd gt _ mask
iA characteristic value at an (x, y) position;
step 3-6, setting a MaskIoU loss function: the predicted value obtained by intersecting the ith characteristic diagram after cascade operation through the prediction mask and comparing the ith characteristic diagram with the MaskIoU Head of the Head is pred _ MaskIoU
iMaskIoU is targeted to gt _ MaskIoU
iThe value is obtained in step 3-5 according to the predicted Mask and the Mask information in the corresponding GT to obtain GT _ Mask iou
i. By using
Calculating L by a loss function
maskiou:
Step 3-7, setting a multitask loss function: adding the classification loss function, the centrality loss function, the distance regression loss function, the Mask loss function and the maskIoU loss function to obtain a total multi-task loss function of a small batch:
wherein N iscls、Nctn、Nreg、Nmask、NmaskIoUIs a normalized coefficient of the corresponding loss function, and Ncls=Nctm=NregTo predict the number of positive samples in the bounding box header, Nmask=NmaskIoUThe number of positive samples obtained for the proposed frames proposals predicted in the prediction frame header based on the IoU threshold and the ratio of sampled positive and negative samples. Lambda [ alpha ]ctn、λcls、λcls、λclsThe balance coefficients for each loss are all 1;
step 3-8, setting relevant parameters of network learning and training: transfer learning is performed using pre-trained weights of VovNetV2-99 on ImageNet. And selecting 1000 propulses with the highest confidence on the frame prediction result, and reserving 100 propulses through non-maximum inhibition, wherein the threshold value of IoU in the non-maximum inhibition NMS is 0.6. The propusals with IoU value greater than 0.5 with the true mark box is considered as positive samples in Mask Head, otherwise, negative samples, and the number of positive samples used for training accounts for 1/4 of all training samples. Meanwhile, the network training adopts small-batch random gradient descent to optimize the network model, the batch size is 2, the learning rate is 0.0025, a Warmup strategy is adopted in the initial 1000 iterations, and meanwhile Momentum with the coefficient of 0.9 is used. In terms of regularization, 10 is adopted-4The weight attenuation coefficient of (2). The whole training epoch is 24 rounds, a learning rate step decreasing strategy is adopted in the 16 th round and the 22 nd round, and the decreasing coefficient is 0.1. After the learning parameters are set, training the constructed network by using the training data enhanced in the step 3-1;
step 4, predicting the segmentation result by using the trained example segmentation network, which specifically comprises the following steps:
step 4-1, performing data enhancement basically the same as that in the step 3-1 on the image needing case segmentation, wherein the difference is that only single-scale scaling is performed on the image during inference, namely the long side is 1333, and the short side is 800;
step 4-2, the trained network performs forward calculation on the test image obtained in the step 1-3, propulses with confidence lower than 0.05 are removed from the head of a prediction frame, 1000 proposed frames propulses with the highest confidence are screened out, at most 50 RoIs are obtained through NMS and used for Mask branch prediction, and finally a kth Mask (but the data used in the text only comprises one type) is obtained according to the type k of the classification branch prediction, the Mask with the size of 28 × 28 is scaled to the size of the corresponding RoI scale, and binarization is performed through a threshold value of 0.5 to generate a final Mask;
step 4-3, predicting Mask intersection and Mask IoU by the Top-50 propulses through the Mask IoU heads, and multiplying the Mask intersection and Mask IoU by the classification confidence coefficient to obtain the final Mask confidence coefficient;
step 4-4, adopting NMS (network management system) according to the Mask generated in the step 4-2 and the Mask confidence coefficient obtained in the corresponding step 4-3, wherein the threshold value of IoU is 0.5, and screening out the final prediction Mask;
step 4-5, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures of the Jianghuai region of China at adjacent moments to obtain a mesoscale convection system example as shown in the figure 5;
step 5, tracking the mesoscale convection system at a plurality of continuous moments according to a related target matching principle, and simultaneously considering the ubiquitous splitting and merging phenomena, the method specifically comprises the following steps:
step 5-1, obtaining the mass center coordinate of each mesoscale convection system example
Characteristic area, strength P:
area=N
where N is the total number of pixel points, x, encompassed by the mesoscale convective system exampleiAnd yiIs the x and y coordinates of the ith pixel point, f (i) is the brightness temperature value of the ith pixel point, fzFor the accumulated sum of the lighting temperatures of the pixels in the example, the image resolution (4km) of the embodiment of the invention needs to be combined in the subsequent calculation;
and 5-2, solving the duration, the position change of the centroid point, the area change and the intensity change of the mesoscale convection system at two adjacent moments according to the time sequence. If the position change of the mass center point and the area change of the mass center point are less than or equal to 50m/s and less than or equal to 5km by two mesoscale convection systems at adjacent moments2The intensity change is less than or equal to 0.001 ℃/s, and the same target can be preliminarily judged;
and 5-3, considering that the splitting and merging phenomena commonly existing in the mesoscale convection system need to be correspondingly processed. If there are multiple mesoscale convective systems MCS at a certain time, it is recorded as X1,X2,…,XnWith a MCS recorded as P at the previous timejIf the tracking matching principle in the step 5-2 is satisfied, the phenomenon belongs to the splitting condition, and X with the largest product after splitting is selectedm(m is more than or equal to 1 and less than or equal to n) continuing PjTrack of (2), update XmThe index value of the previous time of the path is j and the duration is PjThe duration of +1, while the other MCS is considered as the index value is empty and the duration is initialized to 1 at the previous time of the new cloud picture;
if there are multiple MCS recorded at the previous time and a certain MCS recorded as P at the timeaIf the MCS tracking matching principle is satisfied, the phenomenon belongs to the merging situation, and P with the largest area at the previous time is selectedm(1. ltoreq. m. ltoreq.z) as PaLast moment track of, update PmThe index value of the previous time of the path is m and the duration is PmDuration + 1. And the MCS with the non-maximum area at the previous moment is finished in the life cycle, whether the duration time is more than or equal to 1h is judged, if yes, the corresponding MCS path is obtained in a reverse order according to the index value at the previous moment, and the MCS is realizedThe tracking of (2).
In the tracking result shown in fig. 6, every time interval is half an hour, three clouds are recognized at the first time, similarly, three clouds are also recognized at the second time, the characteristic quantities (the centroid position moving distance, the area change and the intensity change) of the clouds at two adjacent times are calculated and analyzed, the clouds corresponding to the first time and conforming to the target matching principle are tracked at the second time to be marked, and it can be seen from the second time in fig. 6 that the cloud 3 identified at the first time cannot find the cloud meeting the target matching principle at the second time, so that the cloud 3 disappears. And when there is a cloud meeting the target matching principle between the cloud cluster 1 and the cloud cluster 2 at the first time and the cloud cluster 2 at the second time, the cloud clusters corresponding to the second time are marked with the same numbers as the cloud cluster 1 and the cloud cluster 2 in the second time of fig. 6. The cloud 4 at the second moment is the new cloud at that moment. According to the tracked clouds identified at the first time and the second time, the tracked target clouds in the third time and the fourth time can be obtained, which are shown in fig. 6 at the third time and the fourth time. Finally, according to the requirement that the duration time is more than or equal to 1h, the MCS meeting the requirement can be obtained by screening, only the cloud cluster 2 in the figure 6 is the correct MCS, and the correct MCS appears at the first time, the second time and the third time and lasts for one hour (more than or equal to 1h is met);
in this embodiment, a mesoscale convection system example segmentation is performed on data of geostationary satellite infrared bright temperature data provided by the public U.S. national marine and atmospheric administration, an experimental configuration environment is shown in table 1, and an experimental result shown in table 2 is obtained by comparing with other currently mainstream deep learning example segmentation methods such as Mask R-CNN, Mask spring R-CNN, Cascade Mask R-CNN, and HTC under the same configuration environment, and the evaluation standard adopts a COCO data set part standard (i.e., box AP, box AR, Mask AP, Mask AR, and inference time).
Table 1 Experimental configuration Environment
Table 2 example segmentation comparative experiment results
Compared with the existing mainstream example segmentation methods, the method has great advantages in accuracy and recall rate, keeps the detection speed similar to Mask R-CNN and does not have extra complicated calculation expenditure, and effectively proves the characteristic of high detection precision and high speed.
The present invention provides a method for identifying and tracking a mesoscale convection system based on image detection without anchor frame, and a method and a way for implementing the technical scheme are many, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and embellishments can be made without departing from the principle of the present invention, and these improvements and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.