CN116797628B

CN116797628B - Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Info

Publication number: CN116797628B
Application number: CN202310429983.XA
Authority: CN
Inventors: 金国栋; 薛远亮; 谭力宁; 高晶; 龙江雄; 田思远
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2025-07-25
Anticipated expiration: 2043-04-21
Also published as: CN116797628A

Abstract

The present invention discloses a multi-scale drone aerial target tracking method and device, which relates to the field of image tracking technology, including: obtaining drone aerial video, inputting the initial frame and current frame of the drone aerial video into the template branch and search branch in the twin tracking network constructed based on the G-ResNet network, respectively outputting three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, using multiple anchor-free region proposal networks, weighted fusion of the three groups of first weighted feature maps and second weighted feature maps, and obtaining the target tracking result of the current frame. This method can solve the problem that drone tracking algorithms cannot achieve a good balance between accuracy and speed.

Description

Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Technical Field

The invention relates to the technical field of image tracking, in particular to a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device.

Background

Object tracking is used to estimate the state of a tracked object for each frame in a video sequence, and information about this tracked object is only given in the first frame. In the tracking process of the unmanned aerial vehicle to the target, video images are transmitted to a ground station for display through a data link system, a control hand controls a stable unmanned aerial vehicle platform and a camera system to search for a scout target through a control rod and other instructions, when the target of interest appears on a picture, the target of interest is selected, and a ground computer extracts a series of characteristics of the target and takes the characteristics as a template. And the computer confirms the position information of the interested target in the subsequent image by calculating the similarity between the template image and the subsequent image, so as to realize continuous tracking of the target.

The main problems of the unmanned aerial vehicle tracking algorithm are divided into two aspects:

The unmanned aerial vehicle tracking method comprises the following steps of tracking precision, namely a video shot by an unmanned aerial vehicle is large in visual field, wide in range, relatively small in shot target size, few in pixel points contained in a target, few and unobvious in characteristics of the target, and more in contained background information, the situation that a plurality of similar targets interfere with tracking of the target is easy to exist, an algorithm is difficult to distinguish the background from the target and is easy to track an error target, camera shake and flying speed change easily occur in the flying process of the unmanned aerial vehicle, problems such as motion blur and appearance change are caused, small targets of the inspection algorithm are characterized and distinguished, the unmanned aerial vehicle is good in maneuverability, flying movement generally has higher degree of freedom, constraint conditions for limiting flying are less, rapid movement, large scale change and the like are easy to occur, and the scale adaptability of the tracking algorithm is insufficient, so that the background information is too much, and the target information is polluted.

The tracking speed is that the shooting equipment carried by the unmanned aerial vehicle is provided with visible light, thermal infrared, SAR and the like, a large amount of data can be collected by one task, the task is usually executed in a mode that a plurality of unmanned aerial vehicles cooperate, the amount of information data to be processed is increased, and the tracking algorithm is required to process a large amount of data information in real time.

The traditional related filtering tracking algorithm is fast, but uses the characteristics of manual design to represent the target, the representation capability of the target is insufficient, and the tracking precision is difficult to improve. Most twin tracking algorithms use a series of complicated operations to pursue tracking accuracy, and neglect the requirement on tracking speed, but the tracking speed is not satisfied and is difficult to deploy on an unmanned plane platform in real time. The existing unmanned aerial vehicle tracking algorithm cannot well reach the balance state of precision and speed.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the first aspect of the present invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method, which comprises:

Acquiring an unmanned aerial vehicle aerial video;

Inputting an initial frame and a current frame of an aerial video of an unmanned plane into a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature graphs and second weighted feature graphs from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel, and adding a double multi-scale attention module behind each Bottleneck;

and carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion results.

Further, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet network by a plurality of convolutionally grouped layers of the same topology stacked in parallel, comprising:

In layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels;

In layer2, the 3 by 3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 8 channels and a size of 3 by group convolution;

In layer3, the 3 by 3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;

in layer4, the 3 by 3 convolution kernels of 512 channels in the 3 Bottleneck residual modules are divided into 32 sets of parallel stacked sets of 32 channels with a3 by 3 convolution kernel size by group convolution.

Further, the first weighted feature map and the second weighted feature map are output from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, respectively, including:

Extracting a first feature map and a second feature map output by a first Bottleneck of layers 2, 3 and 4 of the template branch and the search branch respectively through a double multi-scale attention module;

Grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;

Decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;

processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;

Channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;

Acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;

shuffling the fifth sub-feature maps to obtain a weighted feature map output by a first Bottleneck of the template branch and the search branch of the first Bottleneck;

The weighted feature maps output from the first Bottleneck of the template branch and the search branch are sequentially propagated backward, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4, respectively.

Further, an expression of a position attention response, comprising:

;

Wherein, the Representing a first sub-feature map, IN (X _k1) represents completion using instance normalizationSpatial information statistics of (a),AndRespectively for reinforcementAnd sigmoid nonlinear activation functions.

Further, a channel attention response expression comprising:

;

wherein H and W represent the height and width of the second sub-feature map, respectively, Representing a second sub-feature map, F _gap represents a global average pooling function,The scaling and shifting operations are performed on s,Representing a sigmoid nonlinear activation function.

Further, using a plurality of anchor-free regional suggestion networks, performing weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps, including:

An RPN module without an anchor frame strategy is respectively arranged between three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch predicts the offset between a target pixel point and a real frame through the regression branch;

respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;

Performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;

performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;

acquiring the position of the maximum value of the classification result as the predicted position of the target;

and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.

The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device, which comprises:

The acquisition module is used for acquiring the aerial video of the unmanned aerial vehicle;

The processing module is used for inputting an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature images and second weighted feature images from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual error module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;

and the output module is used for carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frame and the predicted position in the weighted fusion result.

The invention also provides an electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one program, code set or instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial photographing target tracking method according to any one of the first aspect.

The present invention also provides a computer readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method according to any of the first aspects.

The embodiment of the invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device, which have the following beneficial effects compared with the prior art:

1) By utilizing the sub-space learning idea of grouping-conversion-fusion, a grouping residual error network G-ResNet is designed, deep semantic features and diversified features of the target can be extracted, challenges such as appearance change and motion blur of the target are effectively met, and the representation capability of the small target is enhanced.

2) A multi-scale attention module DMSAM is designed, feature images are grouped to extract target feature information of different scales, then double attention is used for respectively extracting local features of targets in space and channel dimensions and establishing global dependence relationship between the targets and the background, and finally information communication between different channels is established, so that the scale adaptation capability and the anti-interference capability of the invention are enhanced.

3) An area suggestion module AF-RPN based on an anchor frame-free strategy is provided to replace a predefined anchor frame, distinguish targets from backgrounds pixel by pixel, and realize self-adaptive perception capability on target scales. And a plurality of AF-RPNs are cascaded on the G-ResNet, so that complementary detailed information and semantic information are effectively utilized to realize robust tracking and accurate positioning of a tracking target. Meanwhile, the speed reaches 40.5 FPS, and the real-time requirement is met.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the following description will make a brief introduction to the drawings used in the description of the embodiments or the prior art. It should be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained from these drawings without inventive effort to those of ordinary skill in the art.

Fig. 1 is a flowchart of an aerial target tracking method of a multi-scale unmanned aerial vehicle provided by an embodiment of the present invention;

Fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a dual multi-scale attention module;

Fig. 3 is a layer1 replacement example diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;

fig. 4 is a DASAM schematic diagram of an aerial target tracking method of a multi-scale unmanned aerial vehicle according to an embodiment of the present invention;

fig. 5 is a schematic shuffle diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;

FIG. 6 is an AF-RPN schematic diagram of an aerial target tracking method for a multi-scale unmanned aerial vehicle according to an embodiment of the present invention;

fig. 7 is a block diagram of an aerial target tracking device of a multi-scale unmanned aerial vehicle according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment).

At present, the target tracking algorithm is mainly divided into a tracking algorithm based on correlation filtering and a tracking algorithm based on deep learning. The correlation filter tracking algorithm uses a correlation filter in the signal processing field to calculate the similarity between the template and the search image, and the Fourier transform is utilized to accelerate in the frequency domain, so that the operation amount is greatly reduced, the operation speed is improved, and hundreds of frames per second can be reached. However, most of related filtering algorithms are used for representing a tracking target by using a traditional feature extraction algorithm, so that the robustness and the accuracy are insufficient, and the target tracking task in a complex scene cannot be effectively processed.

Due to the great potential of the twin tracking algorithm in precision and speed, the twin tracking algorithm gradually becomes a mainstream algorithm in the field of target tracking, and most of follow-up tracking algorithms are researched based on twin structures. The working principle of the twin tracking algorithm can be expressed as formula (1), and the twin tracking algorithm mainly consists of a feature extraction partSimilarity calculation part) And a tracking result generation section.

(1)

In the formula,Is a similarity response graph; Is a feature extraction section; Is a cross-correlation operation; Deviation for each position; Is an identity matrix.

1) And the feature extraction part is used for extracting features by using a twin neural network, and the two branches are respectively a template branch and a search branch. Target image of template branch input initial frameAs templates and output as template feature imagesThe search branch inputs the search image of the subsequent frameOutput as search feature map。

2) Similarity calculation part) Feature information on feature graphs of two branches is integrated, similarity between a search feature graph and a template feature graph is calculated, and a similarity response graph is generated。

3) And a tracking result generating part for predicting the target position on the search image according to the obtained response diagram, wherein the position of the maximum response is generally considered as the target predicted position, and then carrying out target scale estimation and bounding box regression.

The process of on-line tracking by the twin tracking algorithm mainly comprises the following steps:

Inputting the video sequence into the feature extraction part frame by frame;

If the frame is the first frame, extracting target features by a template branch to serve as template features;

if the frame is not the first frame, the searching branch extracts the target feature of the current frame as the searching feature;

the similarity calculation part calculates the similarity between the feature images and generates a response image;

The tracking result generating part predicts the target position in the current frame by using the similarity response diagram;

Repeating the steps 3-5 until the last frame of the video sequence.

Fig. 1 is a flowchart of a multi-scale unmanned aerial vehicle aerial target tracking method provided by an embodiment of the present invention, where, as shown in fig. 1, the method includes:

Step 101, acquiring an unmanned aerial vehicle aerial video;

102, inputting an initial frame and a current frame of an unmanned aerial vehicle aerial video into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, and outputting three groups of first weighted feature graphs and second weighted feature graphs from layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;

And 103, carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion result.

Fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a double multi-scale attention module. As shown in fig. 2, first, a packet residual network (Group Residual Network, G-ResNet) is designed, convolutional blocks having the same topology are stacked in parallel, diversified features of a target are extracted, and a characterization capability of a tracking target is enhanced without increasing a network depth. Second, to better screen features, two multi-scale attentives (Dual Multi Scale Attention Module, DMSAM) are used to extract multi-scale feature information of the target, suppressing interference information in both channel and spatial dimensions. And in the final tracking frame generation stage, a plurality of anchor frame-free regional suggestion networks (Anchor Free Region Proposal Network, AF-RPN) are used for adaptively sensing the scale change of the target, so that the problem of the scale change is effectively solved. Experiments show that the method can more effectively cope with the problems of scale change, small targets, motion blur, partial shielding and the like, improves the tracking effect on aerial targets, achieves the speed of 40.5 FPS and meets the real-time requirement.

In one possible embodiment, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet network by a plurality of convolutionally layered groups of identical topologies stacked in parallel, comprises:

In the embodiment provided by the invention, the invention increases the base number on ResNet-50 with deeper network layer number to improve the network performance. Increasing the cardinality (cardinality) of the network more effectively increases the network's feature description capabilities than increasing the number of network layers, while not increasing the number of network parameters. Based on the design concept of packet-transform-merge (split-transform-merge), as shown in FIG. 3, FIG. 3 shows an alternative example of layer1, taking into account the residual blockIs the main extraction part of the feature information, and therefore will be in the residual blockInstead of stacking multiple convolutions of the same topology in parallel. In the convolution process of the common convolution, one channel of the output feature map needs all channels of the input feature map to participate in calculation. In the implementation of the parallel stacking operation, the number of channels is 64 by grouping convolution (Group convolution)4 Channels divided into 32 groupsDifferent convolution groups can be regarded as different subspaces, and the feature information learned by each subspace is different from each other with emphasis, namely, the diversified feature information of the target is extracted.

In one possible implementation, three sets of first and second weighted feature maps are output from three convolution blocks of layer2, layer3, and layer4, respectively, of the G-ResNet network, including:

In the embodiment provided by the invention, the attention module can adaptively allocate weights and selectively screen the feature map information, so that the network is helped to pay attention to the interested target better, and the defect of G-ResNet can be effectively overcome. Thus, to enhance the discrimination capabilities of the present invention, a dual multiscale attention module (DMSAM) was introduced on G-ResNet. As shown in FIG. 4, in order for the network to learn the feature information of different scales, the DMSAM firstly extracts and groups the features of various scales, then uses the position and channel attention module in parallel to capture local features and global dependencies adaptively, and finally fuses and shuffles the feature graphs of all channels to strengthen the information exchange among different channels.

First assume that the input feature map isWhereinRepresenting the number, height and width of channels of the feature map, respectively. To reduce the calculation cost, willIn the channel dimension intoA group of sub-feature maps is provided,Because the sub-feature images are divided according to channels, each sub-feature image can capture specific semantic information in the training processIs divided into two parts to obtainOne uses channel attention to capture the interrelationship between channels and the other uses position attention to find the spatial relationship between features. Thus by weight allocation of the attention module, the network knows better what is concerned (what) and where is concerned (where) is meaningful.

In one possible embodiment, the expression of the position attention response includes:

(2)

In the embodiment provided by the invention, the object similar to the tracking target is always present in the unmanned aerial vehicle tracking process, so that the characteristic information of the tracking target is present on the characteristic diagram, and the characteristic information of the similar object is also present. The position attention is to enhance the discrimination of similar objects and give a larger degree of attention to the position of the target. The present invention uses instance normalization (Instance Normalization, IN) to complete the alignmentSpatial information statistics onLast position attention responseFrom formula (3):

(3)

wherein: And For strengtheningIs a representation of the capabilities of (1). The weight design of the position attention response to each position of the feature map effectively suppresses the interference of the similar objects, and the aim of the position (where) on the image is clearly focused by the network.

In one possible implementation, the channel attention response expression includes:

(4)

(5)

Different channels on the feature map of the deep network represent different semantic information. The process of channel attention allocation weights can be seen as a process of selecting semantic attributes for different channels. The present invention uses Global Average Pooling (GAP) to compressFeature layer on channel to obtain result:

(6)

To learn the nonlinear relationship between channels, the following followsNonlinear activation function through sigmoidObtaining weight coefficients, adaptively guiding the network to select proper characteristic diagrams, and responding to channel attentionObtained from the formula (7):

(7)

wherein: For a pair of Scaling and shifting operations are performed. And (3) distributing weights of the feature images according to different semantic information, wherein the weight of the channel where the target is located is the largest. In the cross-correlation operation, the responses on the other channels are suppressed, and the aim of what class (what) the network should pay attention to is clear.

Attention response before shufflingAndConnection to obtain new sub-feature mapAll the new sub-feature images are overlapped according to the channels and combined to form the feature imageAs shown in formula (8). Then equation (9) the operation process uses channel shuffling (channel_shuffle) as shown in FIG. 5. Will firstIs unfolded intoMatrix of four dimensions, then matrixUnchanged dimension, pairThe dimensions are transposed, and then the dimensions of the matrix are compressed to obtain an output characteristic diagram. The shuffling operation can effectively integrate the characteristic information on each channel, and strengthen the information exchange between channels.

(8)

(9)

In the DMSAM, target feature information of different scales is extracted from the group feature map, then the double attentions are used for respectively extracting local features of the targets in the channel and space dimensions, establishing global dependency relationship between the targets and the background, finally establishing information communication between different channels, increasing the difference between the targets and interference information, and improving the scale adaptability and discrimination capability of the invention.

In one possible implementation, using a plurality of anchor-free regional suggestion networks, the weighted fusion of the three sets of first weighted feature maps and the second weighted feature map includes:

In one possible implementation manner, a set of anchor frames with different scales are predefined in the RPN module to perform scale estimation, the prior information of the anchor frames is obtained by analyzing from video, the prior information is against the starting point of the tracking task, and the tracking performance is sensitive to the parameters of the anchor frames and needs to be set manually and carefully. Therefore, in order to get rid of excessive dependence on target priori information, the adaptive estimation of the target scale is completed in the RPN module by using an anchor-free frame strategy. RPN module (AF-RPN) based on anchor-free frame strategy, and boundary frame regression branch thereofInstead of regression of the size (length, width, center point position) of the anchor, the offset l, t, b _,r between the target pixel and the real frame (group-truth) is predictedWhether the target in the anchor is a positive sample is judged by calculating the area intersection ratio (Intersection overUnion, ioU) of the anchor and the real frame. Therefore, the anchor-free frame strategy requires a new positive and negative sample discrimination method that the pixel points of the similarity response graph are mapped back into the search image, fall outside the ellipse E ₁ and are negative samples, and fall inside the ellipse E ₂ and are positive samples, as shown in FIG. 6.

(9)

Wherein:、 classification results and regression results; Representing a depth cross-correlation operation; Extracting a network for the features; For the width, height and number of channels of the feature map,

And finding the maximum value on the classification result S, wherein the position of the maximum value is the predicted position of the target, and meanwhile, the position in the regression result has a corresponding predicted boundary box which is used as the predicted box of the target.

The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device 200, as shown in fig. 7, comprising:

an acquisition module 201, configured to acquire an aerial video of the unmanned aerial vehicle;

The processing module 202 is configured to input an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, output three groups of a first weighted feature map and a second weighted feature map from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, respectively, where the G-ResNet network is obtained by stacking a plurality of convolution groups of the same topological structure in parallel, replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network, and adding a double multi-scale attention module behind each Bottleneck;

And the output module 203 is configured to perform weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps by using a plurality of area suggestion networks without anchor frames, and track the target of the current frame according to the prediction frame and the prediction position in the weighted fusion result.

In yet another embodiment of the present invention, there is further provided an apparatus, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.

In yet another embodiment of the present invention, a computer readable storage medium is provided, where at least one instruction, at least one section of program, a code set, or an instruction set is stored, where the at least one instruction, the at least one section of program, the code set, or the instruction set is loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes a plurality of computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of a plurality of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A multi-scale UAV aerial photography target tracking method, characterized by comprising:

Get drone aerial video;

Input the initial frame and the current frame of the drone aerial video into the template branch and the search branch of the twin tracking network built based on the G-ResNet network, and output three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively. The G-ResNet network is obtained by stacking multiple convolutional layer groups of the same topology in parallel, replacing the 3×3 convolution kernel of the residual module of each Bottleneck in the resnet50 network, and adding a dual multi-scale attention module behind each Bottleneck;

Using multiple anchor-free region proposal networks, weighted fusion is performed on the three sets of first weighted feature maps and second weighted feature maps, and the target of the current frame is tracked according to the predicted box and predicted position in the weighted fusion result;

The method replaces the 3x3 convolution kernel of each Bottleneck residual module in the resnet50 network by stacking multiple convolutional layers of the same topology in parallel, including:

In layer1, the 3×3 convolution kernels with 64 channels in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 4 channels and a convolution kernel size of 3×3 through group convolution.

In layer2, the 3×3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 parallel stacked convolution layer groups with 8 channels and 3×3 convolution kernels by group convolution.

In layer3, the 3×3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 16 channels and a convolution kernel size of 3×3 through group convolution.

In layer4, the 3×3 convolution kernels with a channel number of 512 in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with a channel number of 32 and a convolution kernel size of 3×3 through group convolution;

The three groups of first weighted feature maps and second weighted feature maps are outputted from the three convolution blocks layer2, layer3 and layer4 of the G-ResNet network respectively, including:

Through the dual multi-scale attention module, the first feature map and the second feature map of the first Bottleneck output in layer2, layer3 and layer4 of the template branch and the search branch are extracted respectively;

The first feature map and the second feature map are grouped respectively to obtain a plurality of grouped feature maps corresponding to the first feature map and the second feature map respectively;

Decomposing each group feature map into a first sub-feature map and a second sub-feature map;

Processing the first sub-feature map and the second sub-feature map respectively using a position attention module and a channel attention module to obtain a third sub-feature map and a fourth sub-feature map with a position attention response and a channel attention response respectively;

The third sub-feature map and the fourth sub-feature map are channel-fused to obtain a fifth sub-feature map corresponding to a group feature map;

Obtaining multiple fifth sub-feature maps corresponding to the multiple grouping feature maps;

Shuffle multiple fifth sub-feature maps to obtain weighted feature maps of the template branch of the first Bottleneck and the first Bottleneck output of the search branch;

The weighted feature maps output by the first Bottleneck of the template branch and the search branch are propagated backward in sequence, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4 respectively;

The method uses a plurality of anchor-free region proposal networks to perform weighted fusion on three sets of first weighted feature maps and second weighted feature maps, including:

The RPN module with no anchor box strategy is set between the three convolution blocks of layer2, layer3 and layer4 of the template branch and the search branch of the G-ResNet network. The RPN module with no anchor box strategy includes a classification branch and a regression branch. The regression branch is used to predict the offset between the target pixel and the real frame.

Input the first weighted feature map and the second weighted feature map into the convolutional networks in the regression branch and the classification branch of the RPN module without the anchor box strategy respectively, output the regression map and the classification map from the regression branch, and output the regression map and the classification map from the classification branch;

Perform deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain the regression result;

Perform deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain the classification result;

Get the position of the maximum value of the classification result as the predicted position of the target;

Get the predicted bounding box corresponding to the predicted position in the regression result as the predicted box of the target.

2. A multi-scale UAV aerial photography target tracking method according to claim 1, characterized in that the expression of the position attention response includes:

;

in, represents the first sub-feature map, IN(Xk1) represents the use of instance normalization Spatial information statistics , and Respectively used to strengthen The weights and sigmoid nonlinear activation function.

3. A multi-scale UAV aerial photography target tracking method according to claim 1, characterized in that the channel attention response expression comprises:

;

Among them, H and W represent the height and width of the second sub-feature map respectively. represents the second sub-feature map, Fgap represents the global average pooling function, Scale and offset s . Represents the sigmoid nonlinear activation function.

4. A multi-scale UAV aerial photography target tracking device, characterized by comprising:

Acquisition module, used to acquire drone aerial video;

A processing module, used for inputting the initial frame and the current frame of the drone aerial video into the template branch and the search branch in the twin tracking network constructed based on the G-ResNet network, and outputting three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by stacking multiple convolutional layer groups of the same topological structure in parallel, replacing the 3×3 convolution kernel of the residual module of each Bottleneck in the resnet50 network, and adding a dual multi-scale attention module behind each Bottleneck;

An output module is used to perform weighted fusion on the three sets of first weighted feature maps and second weighted feature maps by using multiple anchor-free region proposal networks, and track the target of the current frame according to the predicted box and predicted position in the weighted fusion result;

The method uses a plurality of anchor-free region proposal networks to weightedly fuse three sets of first weighted feature maps and second weighted feature maps, including:

5. An electronic device, characterized in that the electronic device includes a processor and a memory, the memory storing at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to implement the multi-scale UAV aerial target tracking method as described in any one of claims 1-3.

6. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set are loaded and executed by a processor to implement the multi-scale UAV aerial target tracking method as described in any one of claims 1-3.