CN112836713A

CN112836713A - Identification and Tracking Method of Mesoscale Convective System Based on Image Anchorless Frame Detection

Info

Publication number: CN112836713A
Application number: CN202110270336.XA
Authority: CN
Inventors: 杨育彬; 罗威
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-05-25
Anticipated expiration: 2041-03-12
Also published as: CN112836713B

Abstract

The invention discloses a method for identifying and tracking a mesoscale convective system (Mesoscale Convective System, MCS) based on image anchor-free frame detection. Label the mesoscale convective system, and then randomly divide the training set, validation set and test set; step 2, construct an instance segmentation network based on anchor-free frames, which is used to extract image features, detect mesoscale convective systems, and segment out Specific example; Step 3, image enhancement of the training set, and use transfer learning to supervise the training instance to segment the convolutional neural network to automatically learn network parameters; Step 4, use the trained model to perform mesoscale on the geostationary satellite infrared cloud images at adjacent moments Convective system detection and segmentation; Step 5, to achieve the tracking of mesoscale convective systems according to the relevant target matching principle.

Description

Image anchor-frame-free detection-based mesoscale convection system identification and tracking method

Technical Field

The invention belongs to the technical field of deep learning and computer vision, and particularly relates to a mesoscale convection system identification and tracking method based on image anchor-frame-free detection.

Background

Global warming in the climate has led in recent years to increased frequency and activity in heavily convected, disastrous weather. The Mesoscale Convective System (MCS) is a weather System with strong convection, has the characteristics of short life cycle and small spatial scale, is often accompanied by catastrophic weather such as rainstorm, typhoon, hail and the like, endangers life safety of people and causes huge loss to national economy. Particularly in civil aviation operation, the strong convective motion of the MCS can cause severe jolt of the airplane and even cause serious damage to the airplane body, thereby causing air crash accidents. Therefore, how to quickly and accurately identify the MCS instance and analyze the evolution and moving path thereof is very important and is an important research topic of the meteorological discipline today. In the past, the identification and tracking of MCS was mainly based on conventional observation data, however, these data have small coverage areas and are affected by various factors with large errors, so that it is difficult to effectively monitor and forecast the position and development trend of MCS. With the improvement of satellite remote sensing and radar detection capability, especially a high-space-time-resolution geostationary satellite is widely applied to MCS monitoring application by virtue of the characteristics of strong reliability, continuous observation, high precision and the like.

At present, most MCS identification methods are based on traditional image characteristics, namely identification is carried out according to relevant judgment standards, the methods depend on selection of characteristic threshold values too much, and the whole process is relatively complex and large in calculation amount. With the rapid development of deep learning, especially deep convolutional neural networks, complex computer vision tasks such as target detection, instance segmentation, target tracking and the like make breakthrough progress. Therefore, how to use the deep convolutional neural network to quickly, simply and effectively segment the MCS example has great significance for subsequent MCS tracking and related meteorological analysis.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem that the identification and tracking method of the mesoscale convection system based on image anchor-frame-free detection is provided aiming at various judgment standards of the traditional identification method of the mesoscale convection system in the meteorological field. Considering that the MCS form scale changes greatly, the current mainstream example segmentation method based on the heuristic preset anchor frame has a poor effect on the application, so that the category of each position in the Feature Map, the four edge distances to the corresponding frame and the production Mask are directly predicted by a pixel-by-pixel full convolution mode, and the MCS is continuously and effectively tracked for multiple times on the basis.

The method specifically comprises the following steps:

step 1, preprocessing an original infrared brightness temperature data file according to the relevant description of satellite data: cutting original satellite data, marking image polygon examples, and randomly dividing a training set, a verification set and a test set;

step 2, constructing a mesoscale convection system example segmentation convolutional neural network based on an anchor-free frame: the method comprises the steps of dividing the method into a main network for feature extraction, a feature pyramid network for fusing multi-scale features, a prediction frame network head for classification, distance frame regression and center-to-center (center-less) prediction, a Mask network head for generating a Mask and a network head for predicting a Mask IoU;

step 3, performing multi-scale image enhancement on the training set, and automatically learning network parameters by using the mesoscale convective system example segmentation convolutional neural network constructed in the step 2 of migration learning supervision training: training by adopting a small batch random gradient descent method, and setting loss functions of three network heads, namely a prediction frame, a generated Mask and a prediction Mask IoU;

step 4, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures at adjacent moments by using the trained network to obtain mesoscale convection system related records at continuous moments;

step 5, on the basis of the step 4, tracking of the mesoscale convection system is realized according to a related target matching principle, and meanwhile, the commonly occurring splitting and merging conditions are considered;

the method comprises the following steps of 1:

the data set used in the present invention is full resolution (4km) light and temperature data (60N-60S) for monitoring global precipitation obtained from 11 micron infrared channels on geostationary satellites GMS-5, GOES-8, Goes-10, Metasat-7, and Metasat-5, as provided by the National Oceanic and Atmospheric Administration, NOAA. In order to reduce discontinuities between adjacent geostationary satellites, the acquired data has been corner angle correlation corrected and finally stored in the form of a rectangular grid. Each infrared bright temperature data file name format is merg _ yyymddhh _4km-pixel, where yyyy represents year (e.g. 2020), mm represents month (range 01-12), dd represents date (range 01-31), and hh represents hour (range 00-23). Each file contains 2 records of light temperature data corresponding to hours 0 and 30, respectively, each record being 9896 × 3298 in size (9896 dimensions span 0.036378335 and 3298 dimensions span 0.036383683). Meanwhile, in order to store data with the size of one byte (8 bits with the range size of 0-255), the data obtained is to subtract 75 on the basis of the real monitored brightness temperature value, and 255 (corresponding to the real brightness temperature value 330) in the obtained data is an absent value.

Step 1-1, reading an infrared brightness temperature data file of each original geostationary satellite according to the related description of satellite data, wherein the file is an array in a format of 2 x 3298 x 9896, and obtaining two gray cloud pictures of 0 minute and 30 minutes at corresponding time;

in the step 1-2, the size of the remote sensing satellite image is 3298 × 9896, and the image is too large and is not suitable for the input of a subsequent example segmentation network, so that the gray cloud image obtained in the step 1-1 is cut. According to the latitude of 27N to 40N and the longitude of 110E to 125E in the Jianghuai region of China, the relevant description of the brightness temperature data is combined, so that the size of the Jianghuai region of China is obtained as follows:

amplifying the result, and cutting each gray cloud image into more than two (generally 240) gray infrared cloud images with the size of 420 × 360;

and 1-3, giving polygon example level labels of the mesoscale convection system for each gray level infrared cloud subgraph, filtering out subgraphs without the mesoscale convection system, and obtaining json files corresponding to each gray level infrared cloud subgraph, wherein the whole label only has one category. And the grayscale infrared cloud sub-images were randomly divided into 9407 training set images, 3135 verification set images, and 3136 test set images at a ratio of 6:2:2, thereby constituting a training set, a verification set, and a test set.

The step 2 comprises the following steps:

step 2-1, constructing a backbone network for feature extraction: the Backbone Network in the split Network of the embodiment of the invention adopts a convolutional neural Network VoVNetV2-99 (reference: An Energy and GPU-computing Efficient backhaul Network for Real-Time Object Detection.) with a total of 99 layers. The convolutional layers in the convolutional neural network VoVNetV2-99 are all in the form of Conv-BN-ReLU, i.e. in turn a combination of convolutional layers with a bulk normalization layer BN and a linear rectification function ReLU. Meanwhile, cross-layer connection similar to the residual error network ResNet is adopted in the convolutional neural network VoVNetV2-99 to realize identity mapping, and a residual block containing the identity mapping is defined as:

Y＝F(X,{W_i})+X

where X and Y represent the input and output profiles, F (X, { W), respectively, for each building block_i}) is a residual mapping function that needs to be learned. Meanwhile, in order to improve the quality of feature representation, a channel attention eSE (Effective Squeeze-Excitation) (reference: center mask: Real-Time Anchor-Free distance Segmentation) mechanism is also introduced into the network residual block, so that the network focuses more on important feature map channels and inhibits irrelevant channels, and the specific implementation mode is as follows:

A_eSE(X_div)＝σ(W_C(gap(X_div)))

wherein X_divFor the diversified feature map obtained by dimensionality reduction after the network feature map cascade connection,

(W and H are the width and height, respectively, of feature X) is channel-level global average pooling, W_CFor full connection layer weight, X_i,jFor the feature value of the feature map X at (i, j), σ is expressed as Sigmoid activation function, A_eSE(X_div) A characterizer for the computed channel attention, said characterizer associated with X_divElement-level multiplication (symbol for element-level multiplication)

Representation) finally obtained refined feature map X_refine。

The specific construction steps of the convolutional neural network VoVNetV2-99 comprise:

step 2-1-1, constructing a network Stem stage 1: this stage contains three convolutional layers: first, the convolutional layer with convolution kernel size of 3 × 3, step size of 2, padding of 1, and output channel number of 64 (if not described otherwise, the convolutional layer with convolution kernel size of 3 × 3 defaults to step size of 1 and padding of 1) is used to perform downsampling on the input image, and then a convolutional layer with convolution kernel sizes of 3 × 3 and output channel number of 64 and a convolutional layer with convolution kernel size of 3 × 3, step size of 2, and output channel number of 128 are connected. After the input image passes through the stage, a first scale feature map C1 is generated, wherein the feature map C1 is 4 compared with the input image scale Output Stride;

step 2-1-2, constructing a network single Aggregation (OSA) module stage 2: this stage contains a residual block containing 5 convolution kernels of size 3 x 3,Outputting convolution layers with 128 channels, obtaining diversified feature maps with 128 channels by 5 convolution layers after convolution operation of input images, performing cascade operation on the feature maps of the layers, the feature maps of the first 4 layers and the input first scale feature map C1 on the last convolution layer to obtain a feature map with 128 × 6 to 768 channels, and performing dimension reduction through convolution layers with convolution kernel size of 1 × 1, step size of 1, padding of 0 and output channel number of 256 to obtain diversified feature maps X_divCombining the channel attention eSE mechanism to obtain the final X_refine，X_refineElement-level addition with the input signature is also required to achieve identity mapping. Finally, the second scale feature map C2 with the channel number of 256 is obtained at the stage, and the feature map C2 is 4 compared with the input image scale Output Stride;

step 2-1-3, constructing a network single polymerization module stage 3: this phase is similar to the OSA module phase 2, except that it first performs 2-fold downsampling using a 3 × 3 max pooling layer with step size 2 and padding 0, and employs a 3 × 3 adjustable warped convolution in the residual block (ref: Deformable ConvNet v2: More Deformable, Better Results.) instead of regular convolution, the adjustable warped convolution being defined as:

where K is the total number of convolution kernel sample locations (e.g., K is 9 for a convolution kernel size of 3 × 3), and w is_kFor the convolution kernel weight, p_kFor a predefined offset of the kth position with respect to the central position of the receptive field (e.g., p when K is 9)_kE { (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), (1,1) }), x (p) and y (p) are the eigenvalue of the input profile x at position p and the eigenvalue of the output profile y at position p, respectively, Δ p_kOffset of the k-th position, Δ m, which can be learned_k∈[0,1]The number of adjustments for the k-th position. The adjustable deformed convolution implementation mode is to add an additional spatial resolution which is the same as that of the conventional convolution in the conventional convolutionThe convolution layer of the rate and the expansion rate is used for learning the offset and the adjustment number of each position of the feature map in the x and y directions of the two-dimensional plane.

The OSA module stage 3 includes 3 residual blocks, each of which includes 5 scalable convolutional layers with convolution kernel size of 3 × 3 and output channel of 160, and as with the network single aggregation module stage 2 in step 2-1-2, the step map, the first 4 layer feature maps and the input second scale feature map C2 are cascaded on the last convolutional layer to obtain a diversified feature map including 256+160 × 5 × 1056 channels, which is the diversified feature map of the first residual block, and similarly, the number of the diversified feature map channels obtained after the second and third residual block cascade operations is 512+160 × 5 × 1312, and the output channel number of the conventional convolutional layer with step size of 1 after feature map cascade operation and 1 × 1 after mapping being 0 is 512, which also includes a channel attention mechanism, and after the operation of the OSA module stage 3, a third scale feature map C3 with channel number of 512 is finally obtained, feature map C3 is 8 compared to the input image scale Output Stride;

step 2-1-4, constructing a network single polymerization module stage 4: this stage is similar to the OSA module stage 3 described above, except that it contains 9 residual blocks, each containing 5 adjustable warped layers with a convolution kernel size of 3 x 3 and an output channel of 192, the same as the network single aggregation module stage 2 and the network single aggregation module stage 3, the feature map of the layer, the feature map of the previous 4 layers and the input second scale feature map C3 are cascaded on the last convolution layer to obtain a diversified feature map with 512+192 × 5 ═ 1472 channels, which is the diversified feature map of the first residual block, and similarly, the number of the diversified feature map channels obtained after the cascade operation of the second to ninth residual blocks is 768+192 × 5 ═ 1728), and the number of output channels of a 1 × 1 conventional convolutional layer with the step size of 1 and the padding of 0 for dimensionality reduction after the cascade of feature maps is 768, and the channel attention mechanism is also included. Finally obtaining a fourth scale feature map C4 after the network single aggregation module stage 4, wherein the feature map C4 is 16 compared with the input image scale Output Stride;

step 2-1-5, constructing the last structure of the backbone network, and performing single aggregation module stage 5: the phase structure is similar to phase 4, but the phase only contains 3 residual blocks, each of which contains 5 adjustable warped convolutional layers with a convolutional kernel size of 3 x 3 and an output channel of 224, the same as the network single aggregation module stage 2 and the network single aggregation module stage 3, the feature map of the layer, the feature map of the first 4 layers and the input second scale feature map C3 are cascaded on the last convolution layer to obtain a diversified feature map with the number of 768+224 × 5-1888 channels, which is the diversified feature map of the first residual block, and similarly, the number of channels of the diversified feature map obtained after the second and third residual blocks are cascaded is 1024+224 × 5-2144, and the number of output channels of the 1 × 1 conventional convolutional layer with the step size of 1 and the padding of 0 for dimensionality reduction after the feature map cascade is 1024, and the channel attention mechanism is also included. After this stage, a feature map C5 is finally obtained, where the feature map C5 is 32 compared to the input image scale Output Stride;

and 2-2, combining a Feature Pyramid Network (FPN) (reference: Feature Pyramid networks for object detection) to fuse the multi-scale features. And (3) carrying out top-down combination on the features { C3, C4, C5} of different scales obtained in the step 2-1 by the feature pyramid network FPN, and combining transverse connection to fuse the features to obtain { M3, M4, M5 }. Wherein M5 is obtained by passing through a 1 × 1 convolutional layer from a feature map C5, M4 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C4 after nearest neighbor 2-fold upsampling by M5, and M3 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C3 after nearest neighbor 2-fold upsampling by M4.

Finally, for each layer in the { M3, M4 and M5}, the aliasing influence caused by nearest neighbor interpolation is relieved through convolution with the convolution kernel size of 3 multiplied by 3 and the number of output channels of 256, and the feature layer { P3, P4 and P5} is obtained.

Additionally adding a P6 feature layer and a P7 feature layer in the example segmentation network, wherein the P6 feature layer and the P7 feature layer are obtained by performing 2-time downsampling on P5 and P6 through 13 × 3 convolutional layer with the step size of 2 respectively, and finally obtaining feature layers { P3, P4, P5, P6 and P7 };

step 2-3, constructing an example segmentation frame prediction header, wherein the example segmentation frame prediction header comprises 2 branches which are respectively a classification branch and a center-to-ness (reference: FCOS: full connected One-Stage Object Detection.) and a multitask branch parallel to distance frame regression, and the method specifically comprises the following steps:

step 2-3-1, constructing a classification branch: after each feature layer (i.e., P3-P7) in the feature pyramid FPN, three conventional convolutional layers with 256 input channels and 256 output channels and 3 × 3 convolutional kernel size and an adjustable warped convolutional layer are sequentially connected, and the four convolutional layers do not adopt Batch Normalization (BN) but use Group Normalization (GN) to avoid the influence of Batch size on the network model. Because only the class of the mesoscale convection system is detected, a convolution layer for predicting classification is added at the end of a classification branch, the number of input channels is 256, the number of output channels is 1, and the size of a convolution kernel is 3 multiplied by 3;

step 2-3-2, constructing a multi-task branch with the centrality and the distance frame regression being parallel: the part is also sequentially connected with three conventional convolution layers with the same structure as the classification branch and an adjustable deformation convolution layer behind the feature pyramid network FPN feature layer of each scale. However, connected after these four convolutional layers is a convolutional layer containing two branches in parallel: the number of output channels for frame regression is 4, namely 3 multiplied by 3 convolution layers, and four dimension values of the branch output respectively represent the distance from the current position to four edges of the frame; 3 x 3 convolutional layers with the number of output channels being 1 for predicting centrality, the branch output being a one-dimensional representation centrality value;

assuming that the distance d from a certain sample point (x, y) on the feature map F to the four sides of the target frame to which the point belongs is (l, t, r, b), where l, t, r, b respectively represent the distances from the sample point (x, y) to the left, upper, right and lower sides of the rectangular frame, the centrality cennteress is defined as:

where min () and max () are minimum and maximum taking functions, respectively. When the example segmented convolutional neural network is used for detection, the predicted centrality value is multiplied by the classification score to obtain the final confidence. The centrality branch mainly suppresses low-quality frames farther from the center point of the target object, and it can be found from the centrality definition that if the sample point (x, y) is closer to the center of the target frame, the centrality value is closer to 1, otherwise, the centrality value is closer to 0, so that the classification of the farther points is multiplied by a smaller centrality value to obtain a lower confidence, so that the low-quality frames regressed by the farther points are more easily filtered in the Non-Maximum Suppression (NMS) stage.

And 2-4, similarly to the generation of the network RPN by the region in the Anchor-based detector such as Mask R-CNN, predicting a plurality of frame regions by the frame head, and obtaining 100 regions of interest RoI (region of interest) after the regions are subjected to partial elimination through non-maximum suppression with IoU threshold of 0.6. In view of the implementation principle of Mask R-CNN, the method for constructing the RoI header of the instance segmentation network based on the non-anchor frame specifically comprises the following steps:

step 2-4-1, constructing a region-of-interest alignment RoI Align (reference: Mask R-CNN.) layer: the RoI Align layer is firstly mapped to a corresponding feature pyramid network FPN feature layer P according to the size of the RoI of the region of interest_kThe implementation mode is as follows:

k＝Ceil(k_max-log₂(A_input/A_RoI))

wherein A is_inputAnd A_RoIRespectively representing the area of the input image and the area of the RoI, k_maxSetting the number of the last layer of the backbone network as 5, setting Ceil () as an upward rounding function, and setting k as the number of the characteristic layers of the FPN of the characteristic pyramid network mapped by the RoI;

the corresponding position area on the mapped feature map is divided into 196 small areas with the same size by the RoI Align layer, the small areas are sampled in a self-adaptive mode, the central point position of each small area is taken, a bilinear interpolation method is used for calculation, and finally a feature map with the fixed size of 14 x 14 is obtained;

step 2-4-2, constructing a Mask head comprising a spatial attention mechanism: the Mask header contains four convolutional layers with a convolutional kernel size of 3 × 3 and 256 input/output channels. In order to make the branch focus more on the meaningful pixel points, a spatial attention module is introduced after the fourth convolution layer, and the implementation mechanism is as follows:

wherein X_iFor inputting a feature map, A_sag(X_i) Spatial attention feature descriptor, P_maxAnd P_avgA feature map representing maximum pooling and an average pooling in the channel dimension,

representing cascade operation, F_3×3Is a 3 × 3 convolutional layer, σ is a Sigmoid function,

is element-level multiplication, X_sagA characteristic diagram finally combining spatial attention;

then X obtained_sagAnd performing up-sampling by a deconvolution layer with convolution kernel size of 2 multiplied by 2 and step length of 2 to obtain a characteristic diagram with size of 28 multiplied by 28 and same channel number. The convolution layer at the last layer of the Mask head is a Mask prediction layer specific to each category, and the detection target is a category of a mesoscale convection system, so the convolution kernel size of the Mask prediction layer is 1 multiplied by 1, the step length is 1, and the number of output channels is 1;

step 2-4-3, construct maskIoU heads, and use Mask score Mask Scoring (ref: Mask Scoring R-CNN.) to re-express the Mask quality: and 2-4-2, performing 2-time down-sampling on the output characteristic diagram of the Mask prediction layer by adopting a 2 x 2 maximum pooling layer, and then cascading the output characteristic diagram with the RoI characteristic diagram with the output size of 14 x 14 and the channel number of 256 of the RoI Align layer to obtain a characteristic diagram with the size of 28 x 28 and the channel number of 257.

The maskIoU header contains four successive convolutional layers with convolutional kernel size of 3 × 3 and output channel number of 256, where the step size of the last convolutional layer is 2. Then, 2 full-connection layers with the output channel number of 1024 and 1 full-connection layer with the output channel number of 1 are also connected;

the step 3 of the invention comprises:

step 3-1, because the infrared cloud atlas of the data set is a single-channel gray image, the gray image in the training set is converted into three channel images of RGB (red, green and blue) for subsequent transfer learning, the values of the three channels of RGB are the same and are the gray values of the gray image, and data enhancement is performed on the converted images in the training set and corresponding example labels: the image is first scaled to multiple scales, with the long side at 1333 and the short side at a random one of {640,672,704,736,768,800}, and the original scale of the image is preserved while also data enhancement is performed using random horizontal flipping.

The pixel values are centrally processed, and since the training uses a pre-trained model on the ImageNet dataset, the images need to be normalized according to the mean value in the ImageNet dataset. The normalization operation is carried out according to channels, the mean values of R, G, B channels are 123.675, 116.28 and 103.53 respectively, the standard deviations are 58.395, 57.12 and 57.375 respectively, and the mean scalar combination of the RGB three channels is expressed as a vector

The scalar combination of the standard deviation of three channels of RGB is expressed as a vector sigma, an input image is set as x, and the normalized image data x' is set as

Then, filling the image scale into a multiple of 32 to avoid feature loss caused by subsequent convolution operation;

correspondingly, the corresponding instance labels of the input image after zooming and horizontal turning data enhancement are also subjected to the same transformation to obtain correct labels;

step 3-2, setting a classification loss function of the head of the predicted frame in the example segmentation network: the classification task uses the Focal Loss function, Focal local, to ameliorate the class imbalance problem in the one-stage detector. Since only one class of labels is included in the dataset, only one two-classifier needs to be trained. Consider that the network treats each location as a training sample rather than an anchor box, let feature graph F_i(i-3, 4.. 7) position (x) obtained by classification branching_i,y_i) Predicted value is

Is provided with

Wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

then the position (x) is indicated_i,y_i) In the form of a negative sample, the sample,

is a position (x)_i,y_i) Probability of being a positive sample;

will position (x)_i,y_i) The corresponding position (x ', y') mapped to the input image is:

wherein s is_iIs a characteristic diagram F_iCompared to the Output Stride of the input image scale, if (x ', Y') falls into any one of the Group Truth (GT) frames, then the Y is 1 for the positive sample, otherwise the Y is 0. Alpha balance based focus loss function

Expressed as:

where α is the weighting factor, γ ≧ 0 is the adjustable focusing parameter, and typically α and γ are set to 0.25 and 2.0, respectively. For feature map F_iTotal classification loss function of

Comprises the following steps:

step 3-3, setting a centrality loss function: given feature map F_iMiddle and positive sample feature point (x)_i,y_i) Distance frame regression target

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, upper, right and lower sides of the true border. Then the feature point (x)_i,y_i) Is a centrality target

Is defined as:

it is obvious that

Therefore, during training, the central degree branch adopts a binary cross entropy loss function. Let characteristic diagram F_iObtaining position (x) through centrality branching_i,y_i) Predicted value of (2)

Then the characteristic diagram F_iTotal centrality loss function

Comprises the following steps:

wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

then the position (x) is indicated_i,y_i) Are negative examples.

To indicate a function, if the condition in parentheses

If yes, the indication function value is 1, otherwise, the value is 0;

step 3-4, setting a distance regression loss function: the regression task adopts a GIoU loss letterNumber, set distance

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, top, right, and bottom of the predicted bounding box. Regression mesh distance target

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, upper, right and lower sides of the true border. If position (x)_i,y_i) And selecting the frame with the smallest area from the plurality of GT frames as a distance regression target. Will be at distance

And

viewed as a circumscribed border, the GIoU loss function is expressed as:

wherein C is

And

the outer border of the two is represented by | l which represents the area of the region,

indicates that the region C does not contain

And

the area of the part(s) is,

is composed of

And

the cross-over-cross-over ratio of (c),

is composed of

And

the generalized cross-over ratio of (1).

Feature map F_iThe total distance regression loss function is:

wherein

The meaning is the same as in step 3-3,

for the loss function element level weighting coefficients, the value is the position (x)_i,y_i) A centrality regression target;

step 3-5, setting a Mask loss function: predicting the bounding box header before passing through the Mask Head results in a plurality of proposed bounding boxes, and obtaining the RoI of the maximum number of each picture of 100 according to a score threshold of 0.05 and a non-maximum suppression threshold of IoU of 0.6. To improve the convergence rate and improve the detection performance, a GT frame is also added to the RoI for network training. Setting N total RoIs added with GT frames and K total GT frames, calculating IoU between the RoI and the GT, dividing the RoI into positive and negative samples according to a IoU threshold value of 0.5, wherein a positive sample label is 1 and a negative sample label is 0, obtaining a dictionary D corresponding to the ith RoI and the jth RoI which is most matched with the ith RoI, which belongs to 0, N) and related information of the GT frames, and finally sampling according to the proportion that the positive samples account for all the samples 1/4 to obtain a training sample X for Mask Head.

Since Mask Loss is defined only on the positive sample, foreground screening of the training sample X is also required. Setting the head of the prediction frame to obtain M positive samples RoI and the ith RoI_iAlignment of RoIAlign through the region of interest yields a feature map F of 14 × 14 size_i，F_iObtaining a prediction feature map pred with 28 multiplied by 28 size after Mask Headⁱ. Mask target gt _ MaskⁱFirstly, finding GT information (including category, frame and polygon Mask) according to the RoI serial number of D, then cutting the frame area on the original Mask according to the frame information, finally, re-adjusting the cut part to 28 × 28 to obtain the final GT _ Maskⁱ. Computing L using an average two-value cross entropy loss function_mask：

Wherein

And

is a feature map predⁱAnd gt _ maskⁱA characteristic value at an (x, y) position;

step 3-6, settingmaskIoU loss function: the predicted value obtained by intersecting the ith characteristic diagram after cascade operation through the prediction mask and comparing the ith characteristic diagram with the MaskIoU Head of the Head is pred _ MaskIoU_iMaskIoU is targeted to gt _ MaskIoU_iThe value is obtained in step 3-5 according to the predicted Mask and the Mask information in the corresponding GT to obtain GT _ Mask iou_i. By using

Calculating L by a loss function_maskiou：

Step 3-7, setting a multitask loss function: adding the classification loss function, the centrality loss function, the distance regression loss function, the Mask loss function and the maskIoU loss function to obtain a total multi-task loss function L of the small batch:

wherein N is_cls、N_ctn、N_reg、N_mask、N_maskIoUIs a normalized coefficient of the corresponding loss function, and N_cls＝N_ctm＝N_regTo predict the number of positive samples in the bounding box header, N_mask＝N_maskIoUThe number of positive samples obtained for the proposed frames proposals predicted in the prediction frame header based on the IoU threshold and the ratio of sampled positive and negative samples. Lambda [ alpha ]_ctn、λ_cls、λ_cls、λ_clsThe balance coefficients for each loss are all 1;

and 3-8, setting relevant parameters of network learning and training.

The steps 3-8 comprise: transfer learning is performed using pre-trained weights of VovNetV2-99 on ImageNet. Selecting 1000 propulses with highest confidence on the pre-frame measurement result, and reserving 100 propulses through non-maximum inhibition, wherein the non-maximum inhibition is performed on NMSIoU the threshold was 0.6. The propusals with IoU value greater than 0.5 with the true mark box is considered as positive samples in Mask Head, otherwise, negative samples, and the number of positive samples used for training accounts for 1/4 of all training samples. Meanwhile, the network training adopts small-batch random gradient descent to optimize the network model, the batch size is 2, the learning rate is 0.0025, a Warmup strategy is adopted in the initial 1000 iterations, and meanwhile Momentum with the coefficient of 0.9 is used. In terms of regularization, 10 is adopted^-4The weight attenuation coefficient of (2). The whole training epoch is 24 rounds, a learning rate step decreasing strategy is adopted in the 16 th round and the 22 nd round, and the decreasing coefficient is 0.1. After the learning parameters are set, training the constructed network by using the training set processed in the step 3-1;

the step 4 of the invention comprises:

step 4-1, performing the same processing as in step 3-1 on the image needing instance segmentation, except that only single-scale scaling is performed on the image during inference, namely the long side is 1333 and the short side is 800;

step 4-2, the trained network performs forward calculation on the test image obtained in the step 1-3, proposed frames propalss with confidence coefficient smaller than 0.05 are removed from the head of a predicted frame, 1000 proposed frames propalss with highest confidence coefficient are screened out, 50 RoIs at most are obtained through non-maximum inhibition NMS for Mask branch prediction, and finally a kth Mask (but the data used in the text only comprises one type) is obtained according to the type k of the classified branch prediction, the Mask with the size of 28 × 28 is scaled to the size of the corresponding RoI scale, and binarization is performed through a threshold value of 0.5 to generate a final Mask;

step 4-3, predicting Mask intersection by Top-50 propulses through the maskIoU head, and multiplying the Mask intersection by the classification confidence to obtain the Mask confidence;

step 4-4, adopting NMS (network management system) according to the Mask generated in the step 4-2 and the Mask confidence coefficient obtained in the corresponding step 4-3, wherein the threshold value of IoU is 0.5, and screening out the final prediction Mask;

and 4-5, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures of the Jianghuai region of China at adjacent moments to obtain mesoscale convection system examples at continuous moments.

The step 5 of the invention comprises:

step 5-1, obtaining the mass center coordinate of each mesoscale convection system example

Characteristic area, strength P:

area＝N

where N is the total number of pixel points, x, encompassed by the mesoscale convective system example_iAnd y_iIs the abscissa and ordinate of the ith pixel point, f (i) is the brightness temperature value of the ith pixel point, f_zThe sum of the lighting temperatures of the pixels in the example is accumulated, and the image resolution is required to be combined in the subsequent calculation;

step 5-2, solving the duration, the position change of the centroid point, the area change and the intensity change of the mesoscale convection system at two adjacent moments according to the time sequence; if the position change of the centroid point and the area change of the centroid point are less than or equal to 50m/s and less than or equal to 5km in two mesoscale convection systems at adjacent moments²The intensity change is less than or equal to 0.001 ℃/s, and the initial judgment is the same target;

and 5-3, considering that the splitting and merging phenomena commonly existing in the mesoscale convection system need to be correspondingly processed. If there are n mesoscale convective systems MCS at a certain time, it is recorded as X₁，X₂,…,X_nAnd one MCS at the previous moment is recorded as P_jIf the tracking matching principle in the step 5-2 is satisfied, the phenomenon belongs to the splitting condition, and X with the largest product after splitting is selected_mContinuation P_jM is more than or equal to 1 and less than or equal to n; updating X_mThe index value of the previous time of the path is j and the duration is P_jDuration +1, while other MCSs are considered newThe index value of the cloud picture at the previous moment is empty and the duration is initialized to 1;

if more than two MCSs are marked as P at the previous moment₁，P₂,…,P_zOne MCS associated with this time is denoted P_aIf the MCS tracking matching principle is satisfied, the phenomenon belongs to the merging situation, and P with the largest area at the previous time is selected_m(1. ltoreq. m. ltoreq.z) as P_aLast moment track of, update P_mThe index value of the previous time of the path is m and the duration is P_mDuration of + 1; and when the life cycle of the MCS with the non-maximum area at the previous moment is finished, judging whether the duration time is more than or equal to 1h, and if so, obtaining the corresponding MCS path in a reverse order according to the index value at the previous moment, thereby realizing the tracking of the MCS.

Has the advantages that: MCS identification and tracking are core concerns of meteorological disaster forecasting, in the past, MCS identification methods are usually based on traditional image characteristics, the methods depend on selection of a judgment threshold value, and the whole process involves more image processing technologies and is more complicated. The invention adopts a method based on a deep convolution neural network to identify the MCS, and avoids the problems of sensitivity to parameters such as the size of the anchor frame dimension, the transverse-longitudinal ratio and the quantity of the anchor frame in a detection method based on a heuristic preset anchor frame in a full convolution mode without the anchor frame. Meanwhile, the invention further improves the modeling capability of the MCS geometric deformation by combining with the deformable convolution, and focuses on the important channel or position information more. Compared with other deep learning segmentation methods, the method not only achieves better segmentation performance on MCS identification, but also has fewer network parameters and faster training and inference speed.

Drawings

The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of a mesoscale convection system identification and tracking process of the present invention;

FIG. 2 is an infrared brightness-temperature-grayscale image of an original geostationary satellite according to an embodiment of the present invention;

FIG. 3 is a structure of an exemplary split network based on an anchorless frame of the present invention;

FIG. 4 is a block structure of a residual block of a backbone network of a partitioned network in accordance with an embodiment of the present invention;

FIG. 5 is a graph illustrating the effect of segmentation in an example of a mesoscale convection system of the present invention;

FIG. 6 is a graph illustrating the tracking effect of the mesoscale convection system of the present invention;

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the workflow of identification and tracking of the mesoscale convection system constructed by the method of the present invention can be roughly divided into four stages: the first stage, the data of original satellite data is preprocessed and labeled; the second stage, constructing an instance segmentation network model; in the third stage, training and deducing a network model; and a fourth stage, tracking the detected continuous time mesoscale convection system example. The method for identifying and tracking the mesoscale convection system in the embodiment of the invention specifically comprises the following construction steps:

step 1, as the mesoscale convection system mostly occurs in summer, the infrared brightness and temperature data of the geostationary satellite in the year 2000 to the year 2017 and in the month 6 to the month 9 provided by the national oceanic and atmospheric administration are partially randomly screened to serve as the operation data of the embodiment of the invention. In addition, the screened original satellite data needs to be preprocessed:

(1) reading an array of each original geostationary satellite infrared brightness temperature data file in a format of 2 x 3298 x 9896 according to the related description of the satellite data to obtain a gray cloud chart as shown in fig. 2 at corresponding time points of 0 and 30, wherein a blank part with a pixel value of 255 in the gray cloud chart is regarded as a missing value;

(2) and (3) cutting the gray cloud image obtained in the step 1-1 because the size of the remote sensing satellite image is 3298X 9896, and the image is too large and is not suitable for the input of a subsequent instance segmentation network. According to the latitude of 27N to 40N and the longitude of 110E to 125E in the Jianghuai region of China, the relevant description of the brightness temperature data is combined, so that the size of the Jianghuai region of China is obtained as follows:

amplifying the result, and cutting the whole gray cloud picture into a plurality of sub-pictures of 420 x 360;

(3) and giving polygon example level labels of the mesoscale convection system for each gray level infrared cloud subgraph, filtering out subgraphs without the mesoscale convection system, and obtaining json files corresponding to each image, wherein the whole label only has one category. And randomly dividing the sub-image into 9407 training set images, 3135 verification set images and 3136 test set images at a ratio of 6:2: 2;

step 2, constructing an anchor-frame-free mesoscale convection system example segmentation convolutional neural network, wherein the example segmentation network structure is shown in fig. 3, and the example segmentation network structure in fig. 3 comprises: a backbone network for extracting deep abstract features of the image, wherein the backbone network comprises five stage feature maps of C1, C2, C3, C4 and C5; the feature pyramid network fusing the multi-scale features also comprises five feature layers P3, P4, P5, P6 and P7 with different scales; a network header for a predicted bounding box, comprising two large branches: classifying large Classication branches, large branches with the centrality Center-Ness branch and the distance frame Regression branch in parallel, wherein the head is shared by feature pyramid layers with different scales; a network head of a prediction Mask, which contains a space attention mechanism SAM and needs to perform region-of-interest alignment RoIAlign before entering the head; the input feature map is obtained by performing maximum pooling Maxpooling downsampling on a Mask head output feature map and performing cascade operation Concat on a feature map obtained by RoIAlign;

and 2-1, constructing a backbone network for feature extraction. The backbone network in the split network of the embodiment of the invention adopts a convolutional neural network VoVNetV2-99 with a total of 99 layers. The convolutional layers in the backbone network are all in the form of Conv-BN-ReLU, namely, the combination of the convolutional layers, the batch normalization layer BN and the linear rectification function ReLU is sequentially carried out. Meanwhile, cross-layer connection in a similar residual error network ResNet is adopted in the network structure block to realize identity mapping, and the residual error block containing the identity mapping is defined as follows:

Y＝F(X,{W_i})+X

where X and Y represent the input and output profiles, F (X, { W), respectively, for each building block_i}) is a residual mapping function that needs to be learned. Meanwhile, in order to improve the quality of feature representation, a channel attention eSE (Effective squeee-Excitation) mechanism is also introduced into the network residual block, so that the network focuses more on important feature map channels and inhibits irrelevant channels, and the specific implementation mode is as follows:

A_eSE(X_div)＝σ(W_C(gap(X_div)))

(W and H are the width and height of feature X) is the channel-level global average pooling, W_CFor full connection layer weight, X_i,jFor the feature value of the feature map X at (i, j), σ is expressed as Sigmoid activation function, A_eSE(X_div) Feature descriptors for calculated channel attention, the descriptors and X_divElement level multiplication

Finally obtaining the refined characteristic diagram X_refineThe residual block structure combining the identity mapping and the channel attention mechanism is shown in fig. 4. The structure of the whole backbone network VoVNetV2-99 specifically includes the following:

a. stem stage 1: this stage comprises three convolutional layers: first, the convolutional layer with convolution kernel size of 3 × 3, step size of 2, padding of 1, and output channel number of 64 (if not described otherwise, the convolutional layer with convolution kernel size of 3 × 3 defaults to have step size of 1 and padding of 1) is used to perform downsampling on the input image, and then a convolutional layer with convolution kernel sizes of 3 × 3 and output channel number of 64 and a convolutional layer with convolution kernel size of 3 × 3, step size of 2, and output channel number of 128 are connected. After the input image passes through the stage, a first scale feature map C1 is generated, wherein the feature map C1 is 4 compared with the input image scale Output Stride;

b. single polymerization OSA (One-Shot Aggregation) module stage 2: the stage comprises a residual block, wherein the residual block comprises 5 convolution layers with convolution kernel size of 3 × 3 and output channel of 128, the five convolution layers of the input image are respectively subjected to convolution operation to obtain a diversified feature map with channel number of 128, and the feature map of the layer is subjected to cascade operation on the last convolution layer, the feature map of the previous 4 layers and the input first scale feature map C1 to obtain a feature map with channel number of 128 × 6-768. Then, dimension reduction is carried out through a convolution layer with the convolution kernel size of 1 multiplied by 1, the step length of 1, the padding of 0 and the output channel number of 256 to obtain diversified feature maps X_divCombining the channel attention mechanism eSE to obtain the final X_refine，X_refineElement-level addition with the input signature is also required to achieve identity mapping. Finally, this stage results in a second scale feature map C2, feature map C2 still being 4 compared to the input image scale Output Stride;

osa module stage 3: this phase is similar to the OSA module phase 2, except that it first performs 2-fold downsampling using a 3 × 3 max pooling layer with step size 2 and padding 0, and employs a 3 × 3 adjustable warped convolution instead of the conventional convolution in the residual block, which can be defined as:

where K is the total number of convolution kernel sample locations (e.g., convolution kernel size of 3)3 then K is 9), w_kFor the convolution kernel weight, p_kFor a predefined offset of the kth position with respect to the central position of the receptive field (e.g., p when K is 9)_kE { (-1, -1), (-1,0), (-1,1), (0, -1), (0,0), (0,1), (1, -1), (1,0), (1,1) }), x (p) and y (p) are the characteristic values at position p of input profile x and output profile y, respectively, Δ p_kOffset of the k-th position, Δ m, which can be learned_k∈[0,1]The number of adjustments for the k-th position. The adjustable deformation convolution implementation mode is to add a convolution layer with the same spatial resolution and expansion rate as the conventional convolution in the conventional convolution to learn the offset and the adjustment number of each position of the feature map in the x and y directions of the two-dimensional plane.

The OSA module stage 3 includes 3 residual blocks, each of which includes 5 scalable warped convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 160, and, like the OSA module stage 2 in step 2-1-2, the method also includes performing a cascade operation on the layer feature map, the first 4 layer feature maps and the input second scale feature map C2 on the last convolutional layer to obtain a diversified feature map with 256+160 × 5 ═ 1056 channels (this is the diversified feature map of the first residual block, and similarly, the diversified feature map obtained after the cascade operation of the second and third residual blocks is 512+160 × 5 ═ 1312), and the number of output channels used for the conventional convolutional with a step size of 1 after the feature map cascade operation and a step size of 1 × 1 after the map cascade operation is 0 is 512, and also includes the channel force mechanism. After the operation at this stage, a third scale feature map C3 is finally obtained, where the feature map C3 is 8 compared with the input image scale Output Stride;

osa module stage 4: this stage is similar to the OSA module stage 3 described above, except that it contains 9 residual blocks, each of which contains 5 scalable convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 192, and as in the OSA module stage 2 and stage 3, the step-wise concatenation of the layer feature map, the first 4 layer feature maps and the input second scale feature map C3 on the last convolutional layer results in a diversified feature map containing 512+192 × 5 ═ 1472 channels (this is the diversified feature map of the first residual block, and similarly the number of diversified feature map channels obtained after the second to ninth residual block concatenation operations is 768+192 × 5 ═ 1728), and the number of conventional convolutional output channels of 1 × 1 with a step size of 1 for the step-down of the feature map concatenation layer and a padding of 0 is 768, and also contains the channel attention mechanism. Finally obtaining a fourth scale feature map C4 after passing through the convolutional layers, wherein the feature map C4 is 16 compared with the input image scale Output Stride;

osa module stage 5: the stage structure is similar to stage 4, but this stage only includes 3 residual blocks, each of which includes 5 adjustable warped convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 224, and as in stage 2 and stage 3 of the OSA module, the method also performs a cascade operation on the feature map of this layer, the feature map of the first 4 layers and the input second scale feature map C3 on the last convolutional layer to obtain a diversified feature map 1888 channels (this is the diversified feature map of the first residual block, and similarly the number of the diversified feature map channels obtained after the cascade operation of the second and third residual blocks is 1024+224 5 × 2144), and the number of output channels of the conventional convolutional layer with a step size of 1 for dimension reduction after the cascade operation of the feature map and a padding of 0 is 1024, and also includes a channel attention mechanism. After this stage, a feature map C5 is finally obtained, where the feature map C5 is 32 compared to the input image scale Output Stride;

and 2-2, combining a Feature Pyramid Network (FPN) to fuse the multi-scale features. FPN top-down and lateral connection combination of the FPN on the different scale features { C3, C4, C5} in step 2-1 to fuse the features to obtain { M3, M4, M5 }. Wherein M5 is obtained by passing through a 1 × 1 convolutional layer from a feature map C5, M4 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C4 after nearest neighbor 2-fold upsampling by M5, and M3 is obtained by performing element-level addition on a feature map obtained by passing through a 1 × 1 convolutional layer from a feature map C3 after nearest neighbor 2-fold upsampling by M4. Finally, for each layer in the { M3, M4 and M5}, the aliasing influence caused by nearest neighbor interpolation is relieved through convolution with the convolution kernel size of 3 multiplied by 3 and the number of output channels of 256, and the feature layer { P3, P4 and P5} is obtained. In the split network of this example, P6 and P7 feature layers are additionally added, which are obtained by performing 2-fold down-sampling on P5 and P6 through 13 × 3 convolutional layer with step size of 2, respectively, to finally obtain feature layers { P3, P4, P5, P6, P7 };

step 2-3, constructing an example segmentation frame prediction head, wherein the part totally comprises 2 branches which are respectively a classification branch and a multitask branch with a centrality (center-less) and a distance frame regression in parallel:

step 2-3-1, constructing a classification branch: three conventional convolutional layers with 256 input channels and 256 output channels and 3 × 3 convolution kernel size and one adjustable deformable convolutional layer are connected in sequence after each feature layer (i.e., P3-P7) in the FPN, and the four convolutional layers do not adopt Batch Normalization (BN) but use Group Normalization (GN) to avoid the influence of Batch size on the network model. Because only the class of the mesoscale convection system is detected, a convolution layer of prediction classification is added at the end of the branch, the number of input channels is 256, the number of output channels is 1, and the size of the convolution kernel is 3 multiplied by 3;

step 2-3-2, constructing a multi-task branch with the centrality and the distance frame regression being parallel: the part is also sequentially connected with three conventional convolution layers with the same structure as the classification branch and an adjustable deformation convolution layer after the FPN characteristic layer of each scale. However, connected after these four convolutional layers is a convolutional layer containing two branches in parallel: the number of output channels for frame regression is 4, namely 3 multiplied by 3 convolution layers, and four dimension values of the branch output respectively represent the distance from the current position to four edges of the frame; 3 x 3 convolutional layers with the number of output channels being 1 for predicting centrality, the branch output being a one-dimensional representation centrality value;

assuming that the distance d from a certain sample point (x, y) on the feature map F to the four sides of the target frame to which the point belongs is (l, t, r, b), where l, t, r, b respectively represent the distances to the left, upper, right and lower sides of the rectangular frame, the centrality is defined as:

And 2-4, similarly to the generation of the network RPN by the areas in the Anchor-based detector such as Mask R-CNN, predicting a plurality of frame areas by the frame head, and obtaining 100 RoIs (region of interest) after the areas are subjected to partial elimination through non-maximum suppression with IoU threshold of 0.6. In view of the implementation principle of Mask R-CNN, an instance segmentation network RoI head based on an anchor-free frame is constructed, and the method specifically comprises the following steps:

step 2-4-1, constructing a region of interest alignment RoI Align layer: the layer is firstly mapped to a corresponding FPN characteristic layer P according to the size of the region of interest RoI_kThe implementation mode is as follows:

k＝Ceil(k_max-log₂(A_input/A_RoI))

wherein A is_inputAnd A_RoIRepresenting the areas, k, of the input image and the RoI, respectively_maxSetting the number of the last layer of the backbone network as 5, enabling the Ceil () to be an upward rounding function, and enabling k to be the number of the FPN characteristic layers mapped by the RoI;

then the layer divides the corresponding position area on the mapped feature map into 14 × 14-196 small areas with the same size, adaptively samples the areas, takes the central point position of each part, and calculates by using a bilinear interpolation method to finally obtain a feature map with 14 × 14 fixed size;

step 2-4-2, constructing a Mask head containing a spatial attention mechanism: the part contains four convolutional layers with continuous convolutional kernel size of 3 × 3 and input/output channel number of 256. In order to make the branch focus more on the meaningful pixel points, a spatial attention module is introduced after the fourth convolution layer, and the implementation mechanism is as follows:

wherein X_iFor inputting a feature map, A_sag(X_i) Spatial attention feature descriptor, P_maxAnd P_avgA feature map representing maximum pooling and average pooling over the channel dimension,

then X obtained_sagAnd lifting and adopting a deconvolution layer with convolution kernel size of 2 multiplied by 2 and step length of 2 to obtain a characteristic diagram with size of 28 multiplied by 28 and same channel number. The convolution of the last layer of the Mask head is a Mask prediction layer specific to each class, and the detection target is a class of a mesoscale convection system, so the convolution kernel size of the prediction layer is 1 multiplied by 1, the step length is 1, and the number of output channels is 1;

step 2-4-3, a maskolou header is constructed, which re-represents the Mask quality using Mask score Mask Scoring: firstly, on the basis of the step 2-4-2, the output characteristic diagram of a Mask prediction layer is subjected to 2-time down-sampling by adopting a 2 multiplied by 2 maximum pooling layer and then is cascaded with the RoI characteristic diagram with the output size of 14 multiplied by 14 and the channel number of 256 of the RoI Align layer to obtain the characteristic diagram with the size of 28 multiplied by 28 and the channel number of 257. The maskIoU header contains four successive convolutional layers with convolutional kernel size of 3 × 3 and output channel number of 256, where the step size of the last convolutional layer is 2. Then, 2 full-connection layers with the output channel number of 1024 and 1 full-connection layer with the output channel number of 1 are also connected;

step 3, training the instance segmentation network model, specifically comprising the following steps:

and 3-1, converting the gray image into an RGB three-channel image for subsequent transfer learning because the data set infrared cloud image is a single-channel gray image, wherein the three channel images have the same value and are the original gray values. And performing data enhancement on the converted training set images and corresponding instance labels: the image is first scaled to multiple scales, with the long side at 1333 and the short side at a random one of {640,672,704,736,768,800}, and the original scale of the image is preserved while also data enhancement is performed using random horizontal flipping.

The pixel values are centrally processed, and since the training uses a pre-trained model on the ImageNet dataset, the images need to be normalized according to the mean value in the ImageNet dataset. The normalization operation is carried out according to channels, the mean values of R, G, B channels are 123.675, 116.28 and 103.53 respectively, the standard deviation is 58.395, 57.12 and 57.375, and the mean scalar combination of the RGB three channels is expressed as a vector

The scalar combination of the standard deviation of three channels of RGB is expressed as a vector sigma, and an input image is set as x, then the normalized image data x' is as follows:

Is provided with

Wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

indicates the position (x)_i,y_i) Is the probability of a positive sample. Will position (x)_i,y_i) The mapping to the corresponding positions of the input image is as follows:

wherein s is_iIs a characteristic diagram F_iIf (x ', y') falls into any one of the group Truth (true mark), as compared to the input image scale Output StrideGT) frame is then a positive sample with Y equal to 1, otherwise Y equal to 0. The focus loss function based on alpha balance can be expressed as:

where α is the weighting factor, γ ≧ 0 is the adjustable focusing parameter, and α and γ are set to 0.25 and 2.0, respectively, in the experiment. For feature map F_iThe overall classification loss function is:

Wherein

Is expressed as a characteristic point (x)_i,y_i) Distances to the left, upper, right and lower sides of the true border. Then the feature point (x)_i,y_i) The centrality target of (a) is defined as:

it is obvious that

Then the characteristic diagram F_iTotal centrality lossThe function is:

wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

then the position (x) is indicated_i,y_i) Are negative examples.

To indicate a function, if the condition in parentheses

If yes, the indication function value is 1, otherwise, the value is 0;

step 3-4, setting a distance regression loss function: the regression task adopts a GIoU loss function and sets the distance

Wherein

Is expressed as a characteristic point (x)_i,y_i) Distances to the left, top, right, and bottom of the predicted bounding box. Regression distance target

Wherein

Is expressed as a characteristic point (x)_i,y_i) To the left, upper, right and lower of the true frameThe distance of the edges. If position (x)_i,y_i) And selecting the frame with the smallest area from the plurality of GT frames as a distance regression target. Will be at distance

And

viewed as a circumscribing bounding box, the GIoU loss function can be expressed as:

wherein C is

And

indicates that the region C does not contain

And

the area of the part(s) is,

is composed of

And

the cross-over-cross-over ratio of (c),

is composed of

And

the generalized cross-over ratio of (1). Feature map F_iThe total distance regression loss function is:

wherein

The meaning is the same as in step 3-3,

Since Mask Loss is defined only on the positive sample, foreground screening of the training sample X is also required. Set up to obtain MPositive sample RoI, ith RoI_iAlignment of RoIAlign through the region of interest yields a feature map F of 14 × 14 size_i，F_iObtaining a prediction feature map pred with 28 multiplied by 28 size after Mask Headⁱ. Mask target gt _ MaskⁱFirstly, finding GT information (including category, frame and polygon Mask) according to the RoI serial number of D, then cutting the frame area on the original Mask according to the frame information, finally, re-adjusting the cut part to 28 × 28 to obtain the final GT _ Maskⁱ. Computing L using an average two-value cross entropy loss function_mask：

Wherein

And

step 3-6, setting a MaskIoU loss function: the predicted value obtained by intersecting the ith characteristic diagram after cascade operation through the prediction mask and comparing the ith characteristic diagram with the MaskIoU Head of the Head is pred _ MaskIoU_iMaskIoU is targeted to gt _ MaskIoU_iThe value is obtained in step 3-5 according to the predicted Mask and the Mask information in the corresponding GT to obtain GT _ Mask iou_i. By using

Calculating L by a loss function_maskiou：

Step 3-7, setting a multitask loss function: adding the classification loss function, the centrality loss function, the distance regression loss function, the Mask loss function and the maskIoU loss function to obtain a total multi-task loss function of a small batch:

step 3-8, setting relevant parameters of network learning and training: transfer learning is performed using pre-trained weights of VovNetV2-99 on ImageNet. And selecting 1000 propulses with the highest confidence on the frame prediction result, and reserving 100 propulses through non-maximum inhibition, wherein the threshold value of IoU in the non-maximum inhibition NMS is 0.6. The propusals with IoU value greater than 0.5 with the true mark box is considered as positive samples in Mask Head, otherwise, negative samples, and the number of positive samples used for training accounts for 1/4 of all training samples. Meanwhile, the network training adopts small-batch random gradient descent to optimize the network model, the batch size is 2, the learning rate is 0.0025, a Warmup strategy is adopted in the initial 1000 iterations, and meanwhile Momentum with the coefficient of 0.9 is used. In terms of regularization, 10 is adopted^-4The weight attenuation coefficient of (2). The whole training epoch is 24 rounds, a learning rate step decreasing strategy is adopted in the 16 th round and the 22 nd round, and the decreasing coefficient is 0.1. After the learning parameters are set, training the constructed network by using the training data enhanced in the step 3-1;

step 4, predicting the segmentation result by using the trained example segmentation network, which specifically comprises the following steps:

step 4-1, performing data enhancement basically the same as that in the step 3-1 on the image needing case segmentation, wherein the difference is that only single-scale scaling is performed on the image during inference, namely the long side is 1333, and the short side is 800;

step 4-2, the trained network performs forward calculation on the test image obtained in the step 1-3, propulses with confidence lower than 0.05 are removed from the head of a prediction frame, 1000 proposed frames propulses with the highest confidence are screened out, at most 50 RoIs are obtained through NMS and used for Mask branch prediction, and finally a kth Mask (but the data used in the text only comprises one type) is obtained according to the type k of the classification branch prediction, the Mask with the size of 28 × 28 is scaled to the size of the corresponding RoI scale, and binarization is performed through a threshold value of 0.5 to generate a final Mask;

step 4-3, predicting Mask intersection and Mask IoU by the Top-50 propulses through the Mask IoU heads, and multiplying the Mask intersection and Mask IoU by the classification confidence coefficient to obtain the final Mask confidence coefficient;

step 4-5, carrying out mesoscale convection system example segmentation on the stationary satellite infrared cloud pictures of the Jianghuai region of China at adjacent moments to obtain a mesoscale convection system example as shown in the figure 5;

step 5, tracking the mesoscale convection system at a plurality of continuous moments according to a related target matching principle, and simultaneously considering the ubiquitous splitting and merging phenomena, the method specifically comprises the following steps:

Characteristic area, strength P:

area＝N

where N is the total number of pixel points, x, encompassed by the mesoscale convective system example_iAnd y_iIs the x and y coordinates of the ith pixel point, f (i) is the brightness temperature value of the ith pixel point, f_zFor the accumulated sum of the lighting temperatures of the pixels in the example, the image resolution (4km) of the embodiment of the invention needs to be combined in the subsequent calculation;

and 5-2, solving the duration, the position change of the centroid point, the area change and the intensity change of the mesoscale convection system at two adjacent moments according to the time sequence. If the position change of the mass center point and the area change of the mass center point are less than or equal to 50m/s and less than or equal to 5km by two mesoscale convection systems at adjacent moments²The intensity change is less than or equal to 0.001 ℃/s, and the same target can be preliminarily judged;

and 5-3, considering that the splitting and merging phenomena commonly existing in the mesoscale convection system need to be correspondingly processed. If there are multiple mesoscale convective systems MCS at a certain time, it is recorded as X₁，X₂,…,X_nWith a MCS recorded as P at the previous time_jIf the tracking matching principle in the step 5-2 is satisfied, the phenomenon belongs to the splitting condition, and X with the largest product after splitting is selected_m(m is more than or equal to 1 and less than or equal to n) continuing P_jTrack of (2), update X_mThe index value of the previous time of the path is j and the duration is P_jThe duration of +1, while the other MCS is considered as the index value is empty and the duration is initialized to 1 at the previous time of the new cloud picture;

if there are multiple MCS recorded at the previous time and a certain MCS recorded as P at the time_aIf the MCS tracking matching principle is satisfied, the phenomenon belongs to the merging situation, and P with the largest area at the previous time is selected_m(1. ltoreq. m. ltoreq.z) as P_aLast moment track of, update P_mThe index value of the previous time of the path is m and the duration is P_mDuration + 1. And the MCS with the non-maximum area at the previous moment is finished in the life cycle, whether the duration time is more than or equal to 1h is judged, if yes, the corresponding MCS path is obtained in a reverse order according to the index value at the previous moment, and the MCS is realizedThe tracking of (2).

In the tracking result shown in fig. 6, every time interval is half an hour, three clouds are recognized at the first time, similarly, three clouds are also recognized at the second time, the characteristic quantities (the centroid position moving distance, the area change and the intensity change) of the clouds at two adjacent times are calculated and analyzed, the clouds corresponding to the first time and conforming to the target matching principle are tracked at the second time to be marked, and it can be seen from the second time in fig. 6 that the cloud 3 identified at the first time cannot find the cloud meeting the target matching principle at the second time, so that the cloud 3 disappears. And when there is a cloud meeting the target matching principle between the cloud cluster 1 and the cloud cluster 2 at the first time and the cloud cluster 2 at the second time, the cloud clusters corresponding to the second time are marked with the same numbers as the cloud cluster 1 and the cloud cluster 2 in the second time of fig. 6. The cloud 4 at the second moment is the new cloud at that moment. According to the tracked clouds identified at the first time and the second time, the tracked target clouds in the third time and the fourth time can be obtained, which are shown in fig. 6 at the third time and the fourth time. Finally, according to the requirement that the duration time is more than or equal to 1h, the MCS meeting the requirement can be obtained by screening, only the cloud cluster 2 in the figure 6 is the correct MCS, and the correct MCS appears at the first time, the second time and the third time and lasts for one hour (more than or equal to 1h is met);

in this embodiment, a mesoscale convection system example segmentation is performed on data of geostationary satellite infrared bright temperature data provided by the public U.S. national marine and atmospheric administration, an experimental configuration environment is shown in table 1, and an experimental result shown in table 2 is obtained by comparing with other currently mainstream deep learning example segmentation methods such as Mask R-CNN, Mask spring R-CNN, Cascade Mask R-CNN, and HTC under the same configuration environment, and the evaluation standard adopts a COCO data set part standard (i.e., box AP, box AR, Mask AP, Mask AR, and inference time).

Table 1 Experimental configuration Environment

Table 2 example segmentation comparative experiment results

Compared with the existing mainstream example segmentation methods, the method has great advantages in accuracy and recall rate, keeps the detection speed similar to Mask R-CNN and does not have extra complicated calculation expenditure, and effectively proves the characteristic of high detection precision and high speed.

The present invention provides a method for identifying and tracking a mesoscale convection system based on image detection without anchor frame, and a method and a way for implementing the technical scheme are many, and the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a plurality of improvements and embellishments can be made without departing from the principle of the present invention, and these improvements and embellishments should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. The mesoscale convection system identification and tracking method based on image anchor-frame-free detection is characterized by comprising the following steps of:

step 2, constructing a mesoscale convection system example segmentation convolutional neural network based on an anchor-free frame: the method comprises the steps of dividing the network into a main network for feature extraction, a feature pyramid network for fusing multi-scale features, a predicted frame network head for classification, distance frame regression and prediction centrality, a Mask generation network head and a network head for predicting maskIoU;

and 5, tracking the mesoscale convection system on the basis of the step 4 according to a relevant target matching principle.

2. The method of claim 1, wherein step 1 comprises:

step 1-2, cutting the gray cloud picture obtained in the step 1-1: cutting each gray cloud picture into more than two gray infrared cloud pictures with the size of 420 multiplied by 360;

step 1-3, giving polygon example level labels of a mesoscale convection system for each gray level infrared cloud subgraph, filtering subgraphs without the mesoscale convection system to obtain json files corresponding to each gray level infrared cloud subgraph, wherein the whole label only has one category; and randomly dividing the gray infrared cloud images into a training set image, a verification set image and a test set image in a ratio of 6:2:2, thereby forming a training set, a verification set and a test set.

3. The method of claim 2, wherein step 2 comprises:

step 2-1, constructing a backbone network for feature extraction: the backbone network in the example segmentation network adopts a convolutional neural network VoVNetV2-99 with 99 layers in total, and convolutional layers in the convolutional neural network VoVNetV2-99 are all in the form of Conv-BN-ReLU, namely the combination of the convolutional layers, a batch normalization layer BN and a linear rectification function ReLU in sequence; meanwhile, cross-layer connection is adopted in a VoVNetV2-99 structural block of the convolutional neural network to realize identity mapping, and a residual block containing the identity mapping is defined as:

Y＝F(X,{W_i})+X

where X and Y represent the input and output profiles, F (X, { W), respectively, for each building block_i}) is a residual mapping function to be learned;

meanwhile, a channel attention eSE mechanism is also introduced into the network residual block, and the specific implementation mode is as follows:

A_eSE(X_div)＝σ(W_C(gap(X_div)))

for channel level global average pooling, W and H are the width and height, respectively, of feature map X, X_i,jIs the eigenvalue of the profile X at (i, j), W_CFor full connection layer weights, σ is expressed as Sigmoid activation function, A_eSE(X_div) Feature descriptors for computed channel attention, said feature descriptors and X_divCarrying out element-level multiplication to finally obtain a refined characteristic diagram X_refine；

step 2-1-1, constructing a network Stem stage 1: this stage contains three convolutional layers: firstly, convolution layers with convolution kernel size of 3 multiplied by 3, step length of 2, filling of 1 and output channel number of 64 are used for carrying out down sampling on an input image, and then convolution layers with convolution kernel size of 3 multiplied by 3 and output channel number of 64 and convolution layers with convolution kernel size of 3 multiplied by 3, step length of 2 and output channel number of 128 are connected; after the input image passes through the stage, a first scale feature map C1 is generated, wherein the feature map C1 is 4 compared with the input image scale Output Stride;

step 2-1-2, constructing a network single aggregation module stage 2, wherein the stage comprises a residual block, and the residual block comprises 5 convolutional layers with the convolutional kernel size of 3 multiplied by 3 and the output channel size of 128; after the convolution operation is carried out on the input image, 5 convolution layers respectively obtain diversified feature maps with the channel number of 128, cascade operation is carried out on the feature maps of the layers on the last convolution layer, the feature maps of the first 4 layers and the input first scale feature map C1 to obtain a feature map with the channel number of 128X 6-768, and then dimension reduction is carried out through the convolution layers with the convolution kernel size of 1X 1, the step size of 1, the padding of 0 and the output channel number of 256 to obtain diversified feature maps X_divCombining the channel attention eSE mechanism to obtain the final X_refine，X_refineElement-level addition is also required to be carried out on the input feature map to realize identity mapping, and finally, a second scale feature map C2 with the channel number being 256 is obtained at the stage, and the scale scaling Output Stride of the feature map C2 compared with the input image is 4;

step 2-1-3, constructing a network single polymerization module stage 3: a 2-fold downsampling is performed using a 3 × 3 maximum pooling layer with step size of 2 and padding of 0, and a 3 × 3 adjustable warped convolution is employed in the residual block, which is defined as:

where K is the total number of convolution kernel sample locations, w_kFor the convolution kernel weight, p_kFor the predefined offset of the k-th position relative to the central position of the receptive field, x (p) and y (p) are the characteristic value of the input characteristic diagram x at the p position and the characteristic value of the output characteristic diagram y at the p position, respectively, Δ p_kOffset of the k-th position, Δ m, which can be learned_k∈[0,1]Is the adjustment number of the k position; the adjustable deformed convolution implementation mode is to add a convolution layer with the same spatial resolution and expansion rate as the conventional convolution in the conventional convolution to learn that each position of the feature map is in two-dimensional flatnessOffset and adjustment number in the x and y directions of the plane;

the OSA module stage 3 includes 3 residual blocks, each of which includes 5 adjustable warped convolutional layers with a convolutional kernel size of 3 × 3 and an output channel of 160, and as with the network single aggregation module stage 2 in step 2-1-2, the method also includes performing a cascade operation on the layer feature map, the first 4 layer feature maps and the input second scale feature map C2 on the last convolutional layer to obtain a diversified feature map including 256+160 × 5 ═ 1056 channel numbers, which is the diversified feature map of the first residual block, and then the number of channels of the diversified feature map obtained after the cascade operation of the second and third residual blocks is 512+160 × 5 ═ 1312, and the number of output channels of a conventional convolutional layer with a step size of 1 after the cascade operation of the feature map layers and a padding of 1 × 1 is 0 is 512, and also includes a channel attention system; after the operation of the OSA module in stage 3, a third scale feature map C3 with the channel number being 512 is finally obtained, wherein the feature map C3 is 8 compared with the input image scale Output Stride;

step 2-1-4, constructing a network single polymerization module stage 4: the stage comprises 9 residual blocks, each of which comprises 5 adjustable warped convolutional layers with convolution kernel size of 3 × 3 and output channel of 192, and as with the stage 2 of the network single aggregation module and the stage 3 of the network single aggregation module, the step is to perform cascade operation on the feature map of the layer, the feature map of the first 4 layers and the input second scale feature map C3 on the last convolutional layer to obtain a diversified feature map with 512+192 × 5 ═ 1472 channels, which is the diversified feature map of the first residual block, the channel number of the diversified feature map obtained after cascade operation of the second to ninth residual blocks is 768+192 × 5 ═ 1728, and the step size for dimension reduction after cascade operation of the feature map is 1, the conventional output channel number of 1 × 1 with padding being 0 is 768, and the channel attention mechanism is also included; finally obtaining a fourth scale feature map C4 after the network single aggregation module stage 4, wherein the feature map C4 is 16 compared with the input image scale Output Stride;

step 2-1-5, constructing the last structure of the backbone network, and performing single aggregation module stage 5: the stage comprises 3 residual blocks, each of which comprises 5 adjustable warped convolutional layers with convolution kernel size of 3 × 3 and output channel of 224, and as with the network single aggregation module stage 2 and the network single aggregation module stage 3, the last convolutional layer is also subjected to cascade operation on the layer of feature map, the first 4 layers of feature maps and the input second scale feature map C3 to obtain a diversified feature map comprising 768+224 × 5 ═ 1888 channels, the diversified feature map is the first residual block, the diversified feature map obtained after cascade operation of the second and third residual blocks is 1024+224 × 5 ═ 2144, the number of output channels of the conventional convolutional layer with step size of 1 for dimension reduction and padding of 0 after cascade operation of the feature map is 1024, and the conventional convolutional layer also comprises a channel attention mechanism; after this stage, a feature map C5 is finally obtained, where the feature map C5 is 32 compared to the input image scale Output Stride;

step 2-2, combining the feature pyramid network FPN to fuse multi-scale features: the feature pyramid network FPN performs top-down and transverse connection on the features { C3, C4, C5} of different scales obtained in the step 2-1 to obtain a feature { M3, M4, M5}, wherein M5 is obtained by performing element-level addition on a feature map obtained by performing 1 × 1 convolution on a feature map C5, M4 is obtained by performing nearest neighbor 2-time upsampling on M5 and performing 1 × 1 convolution on the feature map C4, and M3 is obtained by performing nearest neighbor 2-time upsampling on M4 and performing element-level addition on the feature map obtained by performing 1 × 1 convolution on the feature map C3;

finally, obtaining feature layers { P3, P4 and P5} of each layer in { M3, M4 and M5} through convolution with convolution kernel size of 3 multiplied by 3 and output channel number of 256;

step 2-3, constructing an example segmentation frame prediction head, wherein the part comprises 2 branches which are respectively a classification branch and a multitask branch with the centrality and the distance frame regression in parallel, and the method specifically comprises the following steps:

step 2-3-1, constructing a classification branch: sequentially connecting three conventional convolutional layers with input channel number and output channel number of 256 and convolution kernel size of 3 multiplied by 3 and an adjustable deformation convolutional layer behind each characteristic layer in the characteristic pyramid network FPN, wherein the four convolutional layers do not adopt batch normalization BN but use group normalization GN; adding a convolution layer for predicting classification at the end of the classification branch, wherein the number of input channels is 256, the number of output channels is 1, and the size of convolution kernel is 3 multiplied by 3;

step 2-3-2, constructing a multi-task branch with the centrality and the distance frame regression being parallel: the part is sequentially connected with three conventional convolutional layers with the same structure with the classification branches and an adjustable deformation convolutional layer after a feature pyramid network FPN feature layer of each scale, and the four convolutional layers are connected with a convolutional layer comprising two parallel branches: the number of output channels for frame regression is 4, namely 3 multiplied by 3 convolution layers, and four dimension values of the branch output respectively represent the distance from the current position to four edges of the frame; 3 x 3 convolutional layers with the number of output channels being 1 for predicting centrality, the branch output being a one-dimensional representation centrality value;

assuming that the distance d from a sample point (x, y) on the feature map F to the four sides of the target frame to which the point belongs is (l, t, r, b), where l, t, r, b respectively represent the distances from the sample point (x, y) to the left, top, right and bottom sides of the rectangular frame, the centrality cennteress is defined as:

wherein min () and max () are minimum and maximum functions, respectively;

step 2-4, constructing an instance segmentation network RoI head based on an anchor-free frame, and specifically comprising the following steps:

step 2-4-1, constructing a region of interest alignment RoIAlign layer: the RoIAlign layer firstly maps the RoI to the corresponding feature pyramid network FPN feature layer P according to the size of the RoI of the region of interest_kThe implementation mode is as follows:

k＝Ceil(k_max-log₂(A_input/A_RoI))

wherein A is_inputAnd A_RoIRespectively representing the area of the input image and the area of the RoI; k is a radical of_maxSetting the last layer number of the backbone network as 5; ceil () is an upward rounding function, and k is the number of characteristic layers of a characteristic pyramid network FPN mapped by RoI;

step 2-4-2, constructing a Mask head comprising a spatial attention mechanism: the Mask head comprises four continuous convolution layers with convolution kernel size of 3 multiplied by 3 and input and output channel number of 256, and a space attention module is introduced after the fourth convolution layer, and the realization mechanism is as follows:

then X_sagBy convolution kernel size 2 x 2, step size 2Performing up-sampling on the deconvolution layer to obtain a characteristic diagram with the size of 28 multiplied by 28 and the same channel number; the convolution layer at the last layer of the Mask head is a Mask prediction layer specific to each category, and the detection target is a category of a mesoscale convection system, so the convolution kernel size of the Mask prediction layer is 1 multiplied by 1, the step length is 1, and the number of output channels is 1;

and 2-4-3, constructing a maskIoU head, and using Mask score Mask ordering to represent Mask quality again.

4. The method of claim 3, wherein steps 2-4-3 comprise: on the basis of the step 2-4-2, performing 2-time down-sampling on an output characteristic diagram of a Mask prediction layer by adopting a 2 multiplied by 2 maximum pooling layer, and then cascading the output characteristic diagram with a RoI characteristic diagram with the output size of 14 multiplied by 14 and the channel number of 256 of a RoI Align layer to obtain a characteristic diagram with the output size of 28 multiplied by 28 and the channel number of 257;

the maskIoU header contains four successive convolutional layers with convolutional kernel size of 3 × 3 and output channel number of 256, wherein the step size of the last convolutional layer is 2, and then 2 fully-connected layers with output channel number of 1024 and 1 fully-connected layer with output channel number of 1 are connected.

5. The method of claim 4, wherein step 3 comprises the steps of:

step 3-1, converting the gray level images in the training set into images of three RGB channels, wherein the values of the three RGB channels are the same and are the gray level values of the gray level images, and performing data enhancement on the converted images in the training set and corresponding example labels: firstly, scaling an image according to a plurality of scale scales, wherein the long side is 1333, the short side is random one of {640,672,704,736,768,800}, the original proportion of the image is kept, and meanwhile, random horizontal inversion is adopted for data enhancement;

performing centering processing on pixel values, normalizing the images according to the mean value in the ImageNet data set, performing normalization operation according to channels, wherein the mean values of R, G, B channels are 123.675, 116.28 and 103.53 respectively, the standard deviations are 58.395, 57.12 and 57.375 respectively, and combining the mean scalar quantities of the three channels of RGB into a tableShown as vectors

Then filling the image scale into a multiple of 32;

step 3-2, setting a classification loss function of the head of the predicted frame in the example segmentation network: let characteristic diagram F_i(i-3, 4.. 7) position (x) obtained by classification branching_i,y_i) Predicted value is

Is provided with

Wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

is a position (x)_i,y_i) Probability of being a positive sample;

wherein s is_iIs a characteristic diagram F_iComparing to the input image scale scaling Output Stride, if (x ', Y') falls into any one of the real mark frames, then the sample is positive, and Y is equal to 1, otherwise Y is equal to 0;

alpha balance based focus loss function

Expressed as:

wherein alpha is a weight factor, and gamma is more than or equal to 0 and is an adjustable focusing parameter;

for feature map F_iTotal classification loss function of

Comprises the following steps:

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, upper, right and lower sides of the real border; then the feature point (x)_i,y_i) Is a centrality target

Is defined as:

let characteristic diagram F_iObtaining position (x) through centrality branching_i,y_i) Predicted value of (2)

Then the characteristic diagram F_iTotal centrality loss function

Comprises the following steps:

wherein

Indicating whether it is a positive or negative sample,

indicates the position (x)_i,y_i) In the case of a positive sample,

then the position (x) is indicated_i,y_i) Is a negative sample;

to indicate a function, if the condition in parentheses

If yes, the indication function value is 1, otherwise, the value is 0;

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, top, right, and bottom of the predicted bounding box; the regression distance target is

Wherein

Respectively represent characteristic points (x)_i,y_i) Distances to the left, upper, right and lower sides of the real border; will be at distance

And

viewed as a circumscribed border, the GIoU loss function is expressed as:

wherein C is

And

indicates that the region C does not contain

And

the area of the part(s) is,

is composed of

And

the cross-over-cross-over ratio of (c),

is composed of

And

generalized cross-over ratio of (1);

feature map F_iTotal distance regression loss function

Comprises the following steps:

wherein

step 3-5, setting a Mask loss function: the head of the predicted frame is designed to obtain M RoIs and the ith RoI_iObtaining a prediction feature map pred after aligning the RoIAlign and the Mask Head through the region of interestⁱObtaining RoI from both classification and regression branches_iClass to which it belongs and Mask target gt _ MaskⁱCalculating L using an average two-value cross entropy loss function_mask：

Wherein

And

step 3-6, setting a MaskIoU loss function: the predicted value obtained by intersecting the ith characteristic diagram after cascade operation through the prediction mask and comparing the ith characteristic diagram with the MaskIoU Head of the Head is pred _ MaskIoU_iMaskIoU is targeted to gt _ MaskIoU_iBy using a₂Calculating L by a loss function_maskiou：

3-7, setting a multitask loss function;

and 3-8, setting relevant parameters of network learning and training.

6. The method of claim 5, wherein steps 3-7 comprise: adding the classification loss function, the centrality loss function, the distance regression loss function, the Mask loss function and the maskIoU loss function to obtain a total multi-task loss function L of the small batch:

wherein N is_cls、N_ctn、N_reg、N_mask、N_maskIoUIs a normalized coefficient of the corresponding loss function, and N_cls＝N_ctm＝N_regTo predict the number of positive samples in the bounding box header, N_mask＝N_maskIoUThe number of positive samples obtained according to IoU threshold and the proportion of sampling positive and negative samples after the predicted proposed frames proposals in the header of the predicted frame; lambda [ alpha ]_ctn、λ_cls、λ_cls、λ_clsThe balance factor for each loss.

7. The method of claim 6, wherein steps 3-8 comprise: performing transfer learning by using pre-training weights of VovNetV2-99 on ImageNet, selecting 1000 propulses with the highest confidence coefficient on a pre-frame measurement result, and reserving 100 propulses through non-maximum inhibition, wherein the threshold value of IoU in the non-maximum inhibition is 0.6; regarding the propusals with IoU value greater than 0.5 with the real mark box as positive samples in the Mask Head, otherwise, the samples are negative samples, and the number of positive samples used for training accounts for 1/4 of all training samples; meanwhile, the network training adopts small-batch random gradient descent to optimize the network model, the batch size is 2, the learning rate is 0.0025, and the maximum value isIn the initial 1000 iterations, a Warmup strategy is adopted, and Momentum with a coefficient of 0.9 is used; in terms of regularization, 10 is adopted^-4The weight attenuation coefficient of (2); the whole training epoch is 24 rounds, a learning rate step decreasing strategy is adopted in the 16 th round and the 22 nd round, and a decreasing coefficient is 0.1; and (3) after the learning parameters are set, training the constructed network by using the training set processed in the step (3-1).

8. The method of claim 7, wherein step 4 comprises the steps of:

step 4-2, the trained network performs forward calculation on the test image obtained in the step 1-3, proposed frames propalss with confidence coefficient smaller than 0.05 are removed from the head of a prediction frame, 1000 proposed frames propalss with highest confidence coefficient are screened out, 50 proposed frames propalss for Mask branch prediction are obtained through non-maximum inhibition, and finally the kth Mask is obtained according to the class k of classification branch prediction, and the Mask with the size of 28 × 28 is scaled to the size of the corresponding propalsl scale and is subjected to binarization to generate the final Mask;

step 4-3: top-50 proposals also predict Mask intersection and maskIoU through maskIoU heads, and the value is multiplied by the classification confidence coefficient to obtain a Mask confidence coefficient;

9. The method of claim 8, wherein step 5 comprises the steps of:

step 5-1, solving each mesoscale convection systemCentroid coordinates of the examples

Characteristic area, strength P:

area＝N

where N is the total number of pixel points, x, encompassed by the mesoscale convective system example_iAnd y_iIs the abscissa and ordinate of the ith pixel point, f (i) is the brightness temperature value of the ith pixel point, f_zThe sum of the lighting temperatures of the pixels in the example;

step 5-2, solving the duration, the position change of the centroid point, the area change and the intensity change of the mesoscale convection system at two adjacent moments according to the time sequence; if two mesoscale convection systems at adjacent moments meet the conditions that the position change of the centroid point is less than or equal to 50m/s and the area change is less than or equal to 5km²The intensity change is less than or equal to 0.001 ℃/s, and the initial judgment is the same target;

step 5-3, if there are n mesoscale convection systems MCS at a moment, it is recorded as X₁，X₂,…,X_nAnd one MCS at the previous moment is recorded as P_jIf the tracking matching principle in the step 5-2 is satisfied, the phenomenon belongs to the splitting condition, and X with the largest product after splitting is selected_mContinuation P_jM is more than or equal to 1 and less than or equal to n; updating X_mThe index value of the previous time of the path is j and the duration is P_jThe duration of +1, while the other MCS is considered as the index value is empty and the duration is initialized to 1 at the previous time of the new cloud picture;

if more than two MCSs are marked as P at the previous moment₁，P₂,…,P_zOne MCS associated with this time is denoted P_aIf the MCS tracking matching principle is satisfied, the phenomenon belongs toIn the merging case, P with the largest previous time area is selected_m(1. ltoreq. m. ltoreq.z) as P_aLast moment track of, update P_mThe index value of the previous time of the path is m and the duration is P_mDuration of + 1; and when the life cycle of the MCS with the non-maximum area at the previous moment is finished, judging whether the duration time is more than or equal to 1h, and if so, obtaining the corresponding MCS path in a reverse order according to the index value at the previous moment, thereby realizing the tracking of the MCS.