[go: up one dir, main page]

CN116797628B - Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device - Google Patents

Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Info

Publication number
CN116797628B
CN116797628B CN202310429983.XA CN202310429983A CN116797628B CN 116797628 B CN116797628 B CN 116797628B CN 202310429983 A CN202310429983 A CN 202310429983A CN 116797628 B CN116797628 B CN 116797628B
Authority
CN
China
Prior art keywords
branch
feature map
convolution
weighted
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310429983.XA
Other languages
Chinese (zh)
Other versions
CN116797628A (en
Inventor
金国栋
薛远亮
谭力宁
高晶
龙江雄
田思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rocket Force University of Engineering of PLA
Original Assignee
Rocket Force University of Engineering of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rocket Force University of Engineering of PLA filed Critical Rocket Force University of Engineering of PLA
Priority to CN202310429983.XA priority Critical patent/CN116797628B/en
Publication of CN116797628A publication Critical patent/CN116797628A/en
Application granted granted Critical
Publication of CN116797628B publication Critical patent/CN116797628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种多尺度的无人机航拍目标跟踪方法和装置,涉及图像跟踪技术领域,包括:获取无人机航拍视频,将无人机航拍视频的初始帧和当前帧输入基于G‑ResNet网络构建的孪生跟踪网络中模板分支和搜索分支,分别从G‑ResNet网络的layer2、layer3和layer4三个卷积块输出三组第一加权特征图和第二加权特征图,利用多个无锚框的区域建议网络,对三组第一加权特征图和第二加权特征图进行加权融合,获得当前帧的目标跟踪结果。该方法可以解决无人机跟踪算法都不能很好地达到精度与速度的平衡状态的问题。

The present invention discloses a multi-scale drone aerial target tracking method and device, which relates to the field of image tracking technology, including: obtaining drone aerial video, inputting the initial frame and current frame of the drone aerial video into the template branch and search branch in the twin tracking network constructed based on the G-ResNet network, respectively outputting three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, using multiple anchor-free region proposal networks, weighted fusion of the three groups of first weighted feature maps and second weighted feature maps, and obtaining the target tracking result of the current frame. This method can solve the problem that drone tracking algorithms cannot achieve a good balance between accuracy and speed.

Description

Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device
Technical Field
The invention relates to the technical field of image tracking, in particular to a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device.
Background
Object tracking is used to estimate the state of a tracked object for each frame in a video sequence, and information about this tracked object is only given in the first frame. In the tracking process of the unmanned aerial vehicle to the target, video images are transmitted to a ground station for display through a data link system, a control hand controls a stable unmanned aerial vehicle platform and a camera system to search for a scout target through a control rod and other instructions, when the target of interest appears on a picture, the target of interest is selected, and a ground computer extracts a series of characteristics of the target and takes the characteristics as a template. And the computer confirms the position information of the interested target in the subsequent image by calculating the similarity between the template image and the subsequent image, so as to realize continuous tracking of the target.
The main problems of the unmanned aerial vehicle tracking algorithm are divided into two aspects:
The unmanned aerial vehicle tracking method comprises the following steps of tracking precision, namely a video shot by an unmanned aerial vehicle is large in visual field, wide in range, relatively small in shot target size, few in pixel points contained in a target, few and unobvious in characteristics of the target, and more in contained background information, the situation that a plurality of similar targets interfere with tracking of the target is easy to exist, an algorithm is difficult to distinguish the background from the target and is easy to track an error target, camera shake and flying speed change easily occur in the flying process of the unmanned aerial vehicle, problems such as motion blur and appearance change are caused, small targets of the inspection algorithm are characterized and distinguished, the unmanned aerial vehicle is good in maneuverability, flying movement generally has higher degree of freedom, constraint conditions for limiting flying are less, rapid movement, large scale change and the like are easy to occur, and the scale adaptability of the tracking algorithm is insufficient, so that the background information is too much, and the target information is polluted.
The tracking speed is that the shooting equipment carried by the unmanned aerial vehicle is provided with visible light, thermal infrared, SAR and the like, a large amount of data can be collected by one task, the task is usually executed in a mode that a plurality of unmanned aerial vehicles cooperate, the amount of information data to be processed is increased, and the tracking algorithm is required to process a large amount of data information in real time.
The traditional related filtering tracking algorithm is fast, but uses the characteristics of manual design to represent the target, the representation capability of the target is insufficient, and the tracking precision is difficult to improve. Most twin tracking algorithms use a series of complicated operations to pursue tracking accuracy, and neglect the requirement on tracking speed, but the tracking speed is not satisfied and is difficult to deploy on an unmanned plane platform in real time. The existing unmanned aerial vehicle tracking algorithm cannot well reach the balance state of precision and speed.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the first aspect of the present invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method, which comprises:
Acquiring an unmanned aerial vehicle aerial video;
Inputting an initial frame and a current frame of an aerial video of an unmanned plane into a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature graphs and second weighted feature graphs from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel, and adding a double multi-scale attention module behind each Bottleneck;
and carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion results.
Further, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet network by a plurality of convolutionally grouped layers of the same topology stacked in parallel, comprising:
In layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels;
In layer2, the 3 by 3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 8 channels and a size of 3 by group convolution;
In layer3, the 3 by 3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;
in layer4, the 3 by 3 convolution kernels of 512 channels in the 3 Bottleneck residual modules are divided into 32 sets of parallel stacked sets of 32 channels with a3 by 3 convolution kernel size by group convolution.
Further, the first weighted feature map and the second weighted feature map are output from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, respectively, including:
Extracting a first feature map and a second feature map output by a first Bottleneck of layers 2, 3 and 4 of the template branch and the search branch respectively through a double multi-scale attention module;
Grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;
Decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;
processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;
Channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;
Acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;
shuffling the fifth sub-feature maps to obtain a weighted feature map output by a first Bottleneck of the template branch and the search branch of the first Bottleneck;
The weighted feature maps output from the first Bottleneck of the template branch and the search branch are sequentially propagated backward, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4, respectively.
Further, an expression of a position attention response, comprising:
;
Wherein, the Representing a first sub-feature map, IN (X k1) represents completion using instance normalizationSpatial information statistics of (a),AndRespectively for reinforcementAnd sigmoid nonlinear activation functions.
Further, a channel attention response expression comprising:
;
wherein H and W represent the height and width of the second sub-feature map, respectively, Representing a second sub-feature map, F gap represents a global average pooling function,The scaling and shifting operations are performed on s,Representing a sigmoid nonlinear activation function.
Further, using a plurality of anchor-free regional suggestion networks, performing weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps, including:
An RPN module without an anchor frame strategy is respectively arranged between three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch predicts the offset between a target pixel point and a real frame through the regression branch;
respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;
Performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;
performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;
acquiring the position of the maximum value of the classification result as the predicted position of the target;
and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.
The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device, which comprises:
The acquisition module is used for acquiring the aerial video of the unmanned aerial vehicle;
The processing module is used for inputting an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature images and second weighted feature images from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual error module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;
and the output module is used for carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frame and the predicted position in the weighted fusion result.
The invention also provides an electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one program, code set or instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial photographing target tracking method according to any one of the first aspect.
The present invention also provides a computer readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method according to any of the first aspects.
The embodiment of the invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device, which have the following beneficial effects compared with the prior art:
1) By utilizing the sub-space learning idea of grouping-conversion-fusion, a grouping residual error network G-ResNet is designed, deep semantic features and diversified features of the target can be extracted, challenges such as appearance change and motion blur of the target are effectively met, and the representation capability of the small target is enhanced.
2) A multi-scale attention module DMSAM is designed, feature images are grouped to extract target feature information of different scales, then double attention is used for respectively extracting local features of targets in space and channel dimensions and establishing global dependence relationship between the targets and the background, and finally information communication between different channels is established, so that the scale adaptation capability and the anti-interference capability of the invention are enhanced.
3) An area suggestion module AF-RPN based on an anchor frame-free strategy is provided to replace a predefined anchor frame, distinguish targets from backgrounds pixel by pixel, and realize self-adaptive perception capability on target scales. And a plurality of AF-RPNs are cascaded on the G-ResNet, so that complementary detailed information and semantic information are effectively utilized to realize robust tracking and accurate positioning of a tracking target. Meanwhile, the speed reaches 40.5 FPS, and the real-time requirement is met.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the following description will make a brief introduction to the drawings used in the description of the embodiments or the prior art. It should be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained from these drawings without inventive effort to those of ordinary skill in the art.
Fig. 1 is a flowchart of an aerial target tracking method of a multi-scale unmanned aerial vehicle provided by an embodiment of the present invention;
Fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a dual multi-scale attention module;
Fig. 3 is a layer1 replacement example diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;
fig. 4 is a DASAM schematic diagram of an aerial target tracking method of a multi-scale unmanned aerial vehicle according to an embodiment of the present invention;
fig. 5 is a schematic shuffle diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;
FIG. 6 is an AF-RPN schematic diagram of an aerial target tracking method for a multi-scale unmanned aerial vehicle according to an embodiment of the present invention;
fig. 7 is a block diagram of an aerial target tracking device of a multi-scale unmanned aerial vehicle according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment).
At present, the target tracking algorithm is mainly divided into a tracking algorithm based on correlation filtering and a tracking algorithm based on deep learning. The correlation filter tracking algorithm uses a correlation filter in the signal processing field to calculate the similarity between the template and the search image, and the Fourier transform is utilized to accelerate in the frequency domain, so that the operation amount is greatly reduced, the operation speed is improved, and hundreds of frames per second can be reached. However, most of related filtering algorithms are used for representing a tracking target by using a traditional feature extraction algorithm, so that the robustness and the accuracy are insufficient, and the target tracking task in a complex scene cannot be effectively processed.
Due to the great potential of the twin tracking algorithm in precision and speed, the twin tracking algorithm gradually becomes a mainstream algorithm in the field of target tracking, and most of follow-up tracking algorithms are researched based on twin structures. The working principle of the twin tracking algorithm can be expressed as formula (1), and the twin tracking algorithm mainly consists of a feature extraction partSimilarity calculation part) And a tracking result generation section.
(1)
In the formula,Is a similarity response graph; Is a feature extraction section; Is a cross-correlation operation; Deviation for each position; Is an identity matrix.
1) And the feature extraction part is used for extracting features by using a twin neural network, and the two branches are respectively a template branch and a search branch. Target image of template branch input initial frameAs templates and output as template feature imagesThe search branch inputs the search image of the subsequent frameOutput as search feature map
2) Similarity calculation part) Feature information on feature graphs of two branches is integrated, similarity between a search feature graph and a template feature graph is calculated, and a similarity response graph is generated
3) And a tracking result generating part for predicting the target position on the search image according to the obtained response diagram, wherein the position of the maximum response is generally considered as the target predicted position, and then carrying out target scale estimation and bounding box regression.
The process of on-line tracking by the twin tracking algorithm mainly comprises the following steps:
Inputting the video sequence into the feature extraction part frame by frame;
If the frame is the first frame, extracting target features by a template branch to serve as template features;
if the frame is not the first frame, the searching branch extracts the target feature of the current frame as the searching feature;
the similarity calculation part calculates the similarity between the feature images and generates a response image;
The tracking result generating part predicts the target position in the current frame by using the similarity response diagram;
Repeating the steps 3-5 until the last frame of the video sequence.
Fig. 1 is a flowchart of a multi-scale unmanned aerial vehicle aerial target tracking method provided by an embodiment of the present invention, where, as shown in fig. 1, the method includes:
Step 101, acquiring an unmanned aerial vehicle aerial video;
102, inputting an initial frame and a current frame of an unmanned aerial vehicle aerial video into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, and outputting three groups of first weighted feature graphs and second weighted feature graphs from layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;
And 103, carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion result.
Fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a double multi-scale attention module. As shown in fig. 2, first, a packet residual network (Group Residual Network, G-ResNet) is designed, convolutional blocks having the same topology are stacked in parallel, diversified features of a target are extracted, and a characterization capability of a tracking target is enhanced without increasing a network depth. Second, to better screen features, two multi-scale attentives (Dual Multi Scale Attention Module, DMSAM) are used to extract multi-scale feature information of the target, suppressing interference information in both channel and spatial dimensions. And in the final tracking frame generation stage, a plurality of anchor frame-free regional suggestion networks (Anchor Free Region Proposal Network, AF-RPN) are used for adaptively sensing the scale change of the target, so that the problem of the scale change is effectively solved. Experiments show that the method can more effectively cope with the problems of scale change, small targets, motion blur, partial shielding and the like, improves the tracking effect on aerial targets, achieves the speed of 40.5 FPS and meets the real-time requirement.
In one possible embodiment, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet network by a plurality of convolutionally layered groups of identical topologies stacked in parallel, comprises:
In layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels;
In layer2, the 3 by 3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 8 channels and a size of 3 by group convolution;
In layer3, the 3 by 3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;
in layer4, the 3 by 3 convolution kernels of 512 channels in the 3 Bottleneck residual modules are divided into 32 sets of parallel stacked sets of 32 channels with a3 by 3 convolution kernel size by group convolution.
In the embodiment provided by the invention, the invention increases the base number on ResNet-50 with deeper network layer number to improve the network performance. Increasing the cardinality (cardinality) of the network more effectively increases the network's feature description capabilities than increasing the number of network layers, while not increasing the number of network parameters. Based on the design concept of packet-transform-merge (split-transform-merge), as shown in FIG. 3, FIG. 3 shows an alternative example of layer1, taking into account the residual blockIs the main extraction part of the feature information, and therefore will be in the residual blockInstead of stacking multiple convolutions of the same topology in parallel. In the convolution process of the common convolution, one channel of the output feature map needs all channels of the input feature map to participate in calculation. In the implementation of the parallel stacking operation, the number of channels is 64 by grouping convolution (Group convolution)4 Channels divided into 32 groupsDifferent convolution groups can be regarded as different subspaces, and the feature information learned by each subspace is different from each other with emphasis, namely, the diversified feature information of the target is extracted.
In one possible implementation, three sets of first and second weighted feature maps are output from three convolution blocks of layer2, layer3, and layer4, respectively, of the G-ResNet network, including:
Extracting a first feature map and a second feature map output by a first Bottleneck of layers 2, 3 and 4 of the template branch and the search branch respectively through a double multi-scale attention module;
Grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;
Decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;
processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;
Channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;
Acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;
shuffling the fifth sub-feature maps to obtain a weighted feature map output by a first Bottleneck of the template branch and the search branch of the first Bottleneck;
The weighted feature maps output from the first Bottleneck of the template branch and the search branch are sequentially propagated backward, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4, respectively.
In the embodiment provided by the invention, the attention module can adaptively allocate weights and selectively screen the feature map information, so that the network is helped to pay attention to the interested target better, and the defect of G-ResNet can be effectively overcome. Thus, to enhance the discrimination capabilities of the present invention, a dual multiscale attention module (DMSAM) was introduced on G-ResNet. As shown in FIG. 4, in order for the network to learn the feature information of different scales, the DMSAM firstly extracts and groups the features of various scales, then uses the position and channel attention module in parallel to capture local features and global dependencies adaptively, and finally fuses and shuffles the feature graphs of all channels to strengthen the information exchange among different channels.
First assume that the input feature map isWhereinRepresenting the number, height and width of channels of the feature map, respectively. To reduce the calculation cost, willIn the channel dimension intoA group of sub-feature maps is provided,Because the sub-feature images are divided according to channels, each sub-feature image can capture specific semantic information in the training processIs divided into two parts to obtainOne uses channel attention to capture the interrelationship between channels and the other uses position attention to find the spatial relationship between features. Thus by weight allocation of the attention module, the network knows better what is concerned (what) and where is concerned (where) is meaningful.
In one possible embodiment, the expression of the position attention response includes:
(2)
Wherein, the Representing a first sub-feature map, IN (X k1) represents completion using instance normalizationSpatial information statistics of (a),AndRespectively for reinforcementAnd sigmoid nonlinear activation functions.
In the embodiment provided by the invention, the object similar to the tracking target is always present in the unmanned aerial vehicle tracking process, so that the characteristic information of the tracking target is present on the characteristic diagram, and the characteristic information of the similar object is also present. The position attention is to enhance the discrimination of similar objects and give a larger degree of attention to the position of the target. The present invention uses instance normalization (Instance Normalization, IN) to complete the alignmentSpatial information statistics onLast position attention responseFrom formula (3):
(3)
wherein: And For strengtheningIs a representation of the capabilities of (1). The weight design of the position attention response to each position of the feature map effectively suppresses the interference of the similar objects, and the aim of the position (where) on the image is clearly focused by the network.
In one possible implementation, the channel attention response expression includes:
(4)
(5)
wherein H and W represent the height and width of the second sub-feature map, respectively, Representing a second sub-feature map, F gap represents a global average pooling function,The scaling and shifting operations are performed on s,Representing a sigmoid nonlinear activation function.
Different channels on the feature map of the deep network represent different semantic information. The process of channel attention allocation weights can be seen as a process of selecting semantic attributes for different channels. The present invention uses Global Average Pooling (GAP) to compressFeature layer on channel to obtain result:
(6)
To learn the nonlinear relationship between channels, the following followsNonlinear activation function through sigmoidObtaining weight coefficients, adaptively guiding the network to select proper characteristic diagrams, and responding to channel attentionObtained from the formula (7):
(7)
wherein: For a pair of Scaling and shifting operations are performed. And (3) distributing weights of the feature images according to different semantic information, wherein the weight of the channel where the target is located is the largest. In the cross-correlation operation, the responses on the other channels are suppressed, and the aim of what class (what) the network should pay attention to is clear.
Attention response before shufflingAndConnection to obtain new sub-feature mapAll the new sub-feature images are overlapped according to the channels and combined to form the feature imageAs shown in formula (8). Then equation (9) the operation process uses channel shuffling (channel_shuffle) as shown in FIG. 5. Will firstIs unfolded intoMatrix of four dimensions, then matrixUnchanged dimension, pairThe dimensions are transposed, and then the dimensions of the matrix are compressed to obtain an output characteristic diagram. The shuffling operation can effectively integrate the characteristic information on each channel, and strengthen the information exchange between channels.
(8)
(9)
In the DMSAM, target feature information of different scales is extracted from the group feature map, then the double attentions are used for respectively extracting local features of the targets in the channel and space dimensions, establishing global dependency relationship between the targets and the background, finally establishing information communication between different channels, increasing the difference between the targets and interference information, and improving the scale adaptability and discrimination capability of the invention.
In one possible implementation, using a plurality of anchor-free regional suggestion networks, the weighted fusion of the three sets of first weighted feature maps and the second weighted feature map includes:
An RPN module without an anchor frame strategy is respectively arranged between three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch predicts the offset between a target pixel point and a real frame through the regression branch;
respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;
Performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;
performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;
acquiring the position of the maximum value of the classification result as the predicted position of the target;
and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.
In one possible implementation manner, a set of anchor frames with different scales are predefined in the RPN module to perform scale estimation, the prior information of the anchor frames is obtained by analyzing from video, the prior information is against the starting point of the tracking task, and the tracking performance is sensitive to the parameters of the anchor frames and needs to be set manually and carefully. Therefore, in order to get rid of excessive dependence on target priori information, the adaptive estimation of the target scale is completed in the RPN module by using an anchor-free frame strategy. RPN module (AF-RPN) based on anchor-free frame strategy, and boundary frame regression branch thereofInstead of regression of the size (length, width, center point position) of the anchor, the offset l, t, b ,r between the target pixel and the real frame (group-truth) is predictedWhether the target in the anchor is a positive sample is judged by calculating the area intersection ratio (Intersection overUnion, ioU) of the anchor and the real frame. Therefore, the anchor-free frame strategy requires a new positive and negative sample discrimination method that the pixel points of the similarity response graph are mapped back into the search image, fall outside the ellipse E 1 and are negative samples, and fall inside the ellipse E 2 and are positive samples, as shown in FIG. 6.
(9)
Wherein: classification results and regression results; Representing a depth cross-correlation operation; Extracting a network for the features; For the width, height and number of channels of the feature map,
And finding the maximum value on the classification result S, wherein the position of the maximum value is the predicted position of the target, and meanwhile, the position in the regression result has a corresponding predicted boundary box which is used as the predicted box of the target.
The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device 200, as shown in fig. 7, comprising:
an acquisition module 201, configured to acquire an aerial video of the unmanned aerial vehicle;
The processing module 202 is configured to input an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, output three groups of a first weighted feature map and a second weighted feature map from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, respectively, where the G-ResNet network is obtained by stacking a plurality of convolution groups of the same topological structure in parallel, replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the resnet network, and adding a double multi-scale attention module behind each Bottleneck;
And the output module 203 is configured to perform weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps by using a plurality of area suggestion networks without anchor frames, and track the target of the current frame according to the prediction frame and the prediction position in the weighted fusion result.
In yet another embodiment of the present invention, there is further provided an apparatus, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.
In yet another embodiment of the present invention, a computer readable storage medium is provided, where at least one instruction, at least one section of program, a code set, or an instruction set is stored, where the at least one instruction, the at least one section of program, the code set, or the instruction set is loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes a plurality of computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of a plurality of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (6)

1.一种多尺度的无人机航拍目标跟踪方法,其特征在于,包括:1. A multi-scale UAV aerial photography target tracking method, characterized by comprising: 获取无人机航拍视频;Get drone aerial video; 将无人机航拍视频的初始帧和当前帧输入基于G-ResNet网络构建的孪生跟踪网络中模板分支和搜索分支,分别从G-ResNet网络的layer2、layer3和layer4三个卷积块输出三组第一加权特征图和第二加权特征图,所述G-ResNet网络是通过平行堆叠的多个相同拓扑结构的卷积层组,替换resnet50网络中的每个Bottleneck的残差模块的3乘3的卷积核,并在每个Bottleneck后面增加一个双多尺度注意力模块获得的;Input the initial frame and the current frame of the drone aerial video into the template branch and the search branch of the twin tracking network built based on the G-ResNet network, and output three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively. The G-ResNet network is obtained by stacking multiple convolutional layer groups of the same topology in parallel, replacing the 3×3 convolution kernel of the residual module of each Bottleneck in the resnet50 network, and adding a dual multi-scale attention module behind each Bottleneck; 利用多个无锚框的区域建议网络,对三组第一加权特征图和第二加权特征图进行加权融合,根据加权融合的结果中的预测框和预测位置,跟踪当前帧的目标;Using multiple anchor-free region proposal networks, weighted fusion is performed on the three sets of first weighted feature maps and second weighted feature maps, and the target of the current frame is tracked according to the predicted box and predicted position in the weighted fusion result; 所述通过平行堆叠的多个相同拓扑结构的卷积层组,替换resnet50网络中的每个Bottleneck的残差模块的3乘3的卷积核,包括:The method replaces the 3x3 convolution kernel of each Bottleneck residual module in the resnet50 network by stacking multiple convolutional layers of the same topology in parallel, including: 在layer1中,通过分组卷积将3个Bottleneck的残差模块中的通道数为64的3乘3的卷积核分为32组平行堆叠的通道数为4的卷积核大小为3乘3的卷积层组;In layer1, the 3×3 convolution kernels with 64 channels in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 4 channels and a convolution kernel size of 3×3 through group convolution. 在layer2中,通过分组卷积将4个Bottleneck的残差模块中的通道数为128的3乘3的卷积核分为32组平行堆叠的通道数为8的卷积核大小为3乘3的卷积层组;In layer2, the 3×3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 parallel stacked convolution layer groups with 8 channels and 3×3 convolution kernels by group convolution. 在layer3中,通过分组卷积将6个Bottleneck的残差模块中的通道数为256的3乘3的卷积核分为32组平行堆叠的通道数为16的卷积核大小为3乘3的卷积层组;In layer3, the 3×3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 16 channels and a convolution kernel size of 3×3 through group convolution. 在layer4中,通过分组卷积将3个Bottleneck的残差模块中的通道数为512的3乘3的卷积核分为32组平行堆叠的通道数为32的卷积核大小为3乘3的卷积层组;In layer4, the 3×3 convolution kernels with a channel number of 512 in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with a channel number of 32 and a convolution kernel size of 3×3 through group convolution; 所述分别从G-ResNet网络的layer2、layer3和layer4三个卷积块输出三组第一加权特征图和第二加权特征图,包括:The three groups of first weighted feature maps and second weighted feature maps are outputted from the three convolution blocks layer2, layer3 and layer4 of the G-ResNet network respectively, including: 通过双多尺度注意力模块,分别提取模板分支和搜索分支的layer2、layer3和layer4中的第一个Bottleneck输出的第一特征图和第二特征图;Through the dual multi-scale attention module, the first feature map and the second feature map of the first Bottleneck output in layer2, layer3 and layer4 of the template branch and the search branch are extracted respectively; 分别对第一特征图和第二特征图进行分组,分别得到第一特征图和第二特征图对应的多个分组特征图;The first feature map and the second feature map are grouped respectively to obtain a plurality of grouped feature maps corresponding to the first feature map and the second feature map respectively; 将每个分组特征图分解为第一子特征图和第二子特征图;Decomposing each group feature map into a first sub-feature map and a second sub-feature map; 利用位置注意力模块和通道注意力模块分别处理第一子特征图和第二子特征图,分别获得带有位置注意力响应的子特征图和通道注意力响应的第三子特征图和第四子特征图;Processing the first sub-feature map and the second sub-feature map respectively using a position attention module and a channel attention module to obtain a third sub-feature map and a fourth sub-feature map with a position attention response and a channel attention response respectively; 将第三子特征图和第四子特征图进行通道融合得到一个分组特征图对应的第五子特征图;The third sub-feature map and the fourth sub-feature map are channel-fused to obtain a fifth sub-feature map corresponding to a group feature map; 获取多个分组特征图对应的多个第五子特征图;Obtaining multiple fifth sub-feature maps corresponding to the multiple grouping feature maps; 对多个第五子特征图进行混洗,得到第一个Bottleneck的模板分支和搜索分支的第一个Bottleneck输出的加权特征图;Shuffle multiple fifth sub-feature maps to obtain weighted feature maps of the template branch of the first Bottleneck and the first Bottleneck output of the search branch; 将模板分支和搜索分支的第一个Bottleneck输出的加权特征图依次向后传播,分别从layer2、layer3和layer4的最后一个Bottleneck输出第一加权特征图和第二加权特征图;The weighted feature maps output by the first Bottleneck of the template branch and the search branch are propagated backward in sequence, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4 respectively; 所述利用多个无锚框的区域建议网络,对三组第一加权特征图和第二加权特征图进行加权融合,包括:The method uses a plurality of anchor-free region proposal networks to perform weighted fusion on three sets of first weighted feature maps and second weighted feature maps, including: 在G-ResNet网络的模板分支和搜索分支的各自的layer2、layer3和layer4三个卷积块之间分别设置无锚框策略的RPN模块,无锚框策略的RPN模块包括分类分支和回归分支,回归分支用于预测目标像素点与真实框之间的偏移量;The RPN module with no anchor box strategy is set between the three convolution blocks of layer2, layer3 and layer4 of the template branch and the search branch of the G-ResNet network. The RPN module with no anchor box strategy includes a classification branch and a regression branch. The regression branch is used to predict the offset between the target pixel and the real frame. 将第一加权特征图和第二加权特征图分别输入无锚框策略的RPN模块的回归分支和分类分支中的卷积网络中,从回归分支输出回归图和分类图,从分类分支输出回归图和分类图;Input the first weighted feature map and the second weighted feature map into the convolutional networks in the regression branch and the classification branch of the RPN module without the anchor box strategy respectively, output the regression map and the classification map from the regression branch, and output the regression map and the classification map from the classification branch; 将分类分支和回归分支输出的两个回归图进行深度互相关运算,获得回归结果;Perform deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain the regression result; 将分类分支和回归分支输出的两个分类图进行深度互相关运算,获得分类结果;Perform deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain the classification result; 获取分类结果的最大值所处位置作为目标的预测位置;Get the position of the maximum value of the classification result as the predicted position of the target; 在回归结果中获取预测位置对应的预测边界框,作为目标的预测框。Get the predicted bounding box corresponding to the predicted position in the regression result as the predicted box of the target. 2.如权利要求1所述的一种多尺度的无人机航拍目标跟踪方法,其特征在于,所述位置注意力响应的表达式,包括:2. A multi-scale UAV aerial photography target tracking method according to claim 1, characterized in that the expression of the position attention response includes: ; 其中,表示第一子特征图,IN(Xk1)表示使用实例归一化完成的空间信息统计分别表示用于加强的权重和sigmoid非线性激活函数。in, represents the first sub-feature map, IN(Xk1) represents the use of instance normalization Spatial information statistics , and Respectively used to strengthen The weights and sigmoid nonlinear activation function. 3.如权利要求1所述的一种多尺度的无人机航拍目标跟踪方法,其特征在于,所述通道注意力响应表达式,包括:3. A multi-scale UAV aerial photography target tracking method according to claim 1, characterized in that the channel attention response expression comprises: ; ; 其中,H和W分别表示第二子特征图的高和宽,表示第二子特征图,Fgap表示全局平均池化函数,s进行放缩和偏移操作,表示sigmoid非线性激活函数。Among them, H and W represent the height and width of the second sub-feature map respectively. represents the second sub-feature map, Fgap represents the global average pooling function, Scale and offset s . Represents the sigmoid nonlinear activation function. 4.一种多尺度的无人机航拍目标跟踪装置,其特征在于,包括:4. A multi-scale UAV aerial photography target tracking device, characterized by comprising: 获取模块,用于获取无人机航拍视频;Acquisition module, used to acquire drone aerial video; 处理模块,用于将无人机航拍视频的初始帧和当前帧输入基于G-ResNet网络构建的孪生跟踪网络中模板分支和搜索分支,分别从G-ResNet网络的layer2、layer3和layer4三个卷积块输出三组第一加权特征图和第二加权特征图,所述G-ResNet网络是通过平行堆叠的多个相同拓扑结构的卷积层组,替换resnet50网络中的每个Bottleneck的残差模块的3乘3的卷积核,并在每个Bottleneck后面增加一个双多尺度注意力模块获得的;A processing module, used for inputting the initial frame and the current frame of the drone aerial video into the template branch and the search branch in the twin tracking network constructed based on the G-ResNet network, and outputting three groups of first weighted feature maps and second weighted feature maps from the three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by stacking multiple convolutional layer groups of the same topological structure in parallel, replacing the 3×3 convolution kernel of the residual module of each Bottleneck in the resnet50 network, and adding a dual multi-scale attention module behind each Bottleneck; 输出模块,用于利用多个无锚框的区域建议网络,对三组第一加权特征图和第二加权特征图进行加权融合,根据加权融合的结果中的预测框和预测位置,跟踪当前帧的目标;An output module is used to perform weighted fusion on the three sets of first weighted feature maps and second weighted feature maps by using multiple anchor-free region proposal networks, and track the target of the current frame according to the predicted box and predicted position in the weighted fusion result; 所述通过平行堆叠的多个相同拓扑结构的卷积层组,替换resnet50网络中的每个Bottleneck的残差模块的3乘3的卷积核,包括:The method replaces the 3x3 convolution kernel of each Bottleneck residual module in the resnet50 network by stacking multiple convolutional layers of the same topology in parallel, including: 在layer1中,通过分组卷积将3个Bottleneck的残差模块中的通道数为64的3乘3的卷积核分为32组平行堆叠的通道数为4的卷积核大小为3乘3的卷积层组;In layer1, the 3×3 convolution kernels with 64 channels in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 4 channels and a convolution kernel size of 3×3 through group convolution. 在layer2中,通过分组卷积将4个Bottleneck的残差模块中的通道数为128的3乘3的卷积核分为32组平行堆叠的通道数为8的卷积核大小为3乘3的卷积层组;In layer2, the 3×3 convolution kernels with 128 channels in the 4 Bottleneck residual modules are divided into 32 parallel stacked convolution layer groups with 8 channels and 3×3 convolution kernels by group convolution. 在layer3中,通过分组卷积将6个Bottleneck的残差模块中的通道数为256的3乘3的卷积核分为32组平行堆叠的通道数为16的卷积核大小为3乘3的卷积层组;In layer3, the 3×3 convolution kernels with 256 channels in the 6 Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with 16 channels and a convolution kernel size of 3×3 through group convolution. 在layer4中,通过分组卷积将3个Bottleneck的残差模块中的通道数为512的3乘3的卷积核分为32组平行堆叠的通道数为32的卷积核大小为3乘3的卷积层组;In layer4, the 3×3 convolution kernels with a channel number of 512 in the three Bottleneck residual modules are divided into 32 groups of parallel stacked convolution layer groups with a channel number of 32 and a convolution kernel size of 3×3 through group convolution; 所述分别从G-ResNet网络的layer2、layer3和layer4三个卷积块输出三组第一加权特征图和第二加权特征图,包括:The three groups of first weighted feature maps and second weighted feature maps are outputted from the three convolution blocks layer2, layer3 and layer4 of the G-ResNet network respectively, including: 通过双多尺度注意力模块,分别提取模板分支和搜索分支的layer2、layer3和layer4中的第一个Bottleneck输出的第一特征图和第二特征图;Through the dual multi-scale attention module, the first feature map and the second feature map of the first Bottleneck output in layer2, layer3 and layer4 of the template branch and the search branch are extracted respectively; 分别对第一特征图和第二特征图进行分组,分别得到第一特征图和第二特征图对应的多个分组特征图;The first feature map and the second feature map are grouped respectively to obtain a plurality of grouped feature maps corresponding to the first feature map and the second feature map respectively; 将每个分组特征图分解为第一子特征图和第二子特征图;Decomposing each group feature map into a first sub-feature map and a second sub-feature map; 利用位置注意力模块和通道注意力模块分别处理第一子特征图和第二子特征图,分别获得带有位置注意力响应的子特征图和通道注意力响应的第三子特征图和第四子特征图;Processing the first sub-feature map and the second sub-feature map respectively using a position attention module and a channel attention module to obtain a third sub-feature map and a fourth sub-feature map with a position attention response and a channel attention response respectively; 将第三子特征图和第四子特征图进行通道融合得到一个分组特征图对应的第五子特征图;The third sub-feature map and the fourth sub-feature map are channel-fused to obtain a fifth sub-feature map corresponding to a group feature map; 获取多个分组特征图对应的多个第五子特征图;Obtaining multiple fifth sub-feature maps corresponding to the multiple grouping feature maps; 对多个第五子特征图进行混洗,得到第一个Bottleneck的模板分支和搜索分支的第一个Bottleneck输出的加权特征图;Shuffle multiple fifth sub-feature maps to obtain weighted feature maps of the template branch of the first Bottleneck and the first Bottleneck output of the search branch; 将模板分支和搜索分支的第一个Bottleneck输出的加权特征图依次向后传播,分别从layer2、layer3和layer4的最后一个Bottleneck输出第一加权特征图和第二加权特征图;The weighted feature maps output by the first Bottleneck of the template branch and the search branch are propagated backward in sequence, and the first weighted feature map and the second weighted feature map are output from the last Bottleneck of layer2, layer3, and layer4 respectively; 所述利用多个无锚框的区域建议网络,对三组第一加权特征图和第二加权特征图进行加权融合,包括:The method uses a plurality of anchor-free region proposal networks to weightedly fuse three sets of first weighted feature maps and second weighted feature maps, including: 在G-ResNet网络的模板分支和搜索分支的各自的layer2、layer3和layer4三个卷积块之间分别设置无锚框策略的RPN模块,无锚框策略的RPN模块包括分类分支和回归分支,回归分支用于预测目标像素点与真实框之间的偏移量;The RPN module with no anchor box strategy is set between the three convolution blocks of layer2, layer3 and layer4 of the template branch and the search branch of the G-ResNet network. The RPN module with no anchor box strategy includes a classification branch and a regression branch. The regression branch is used to predict the offset between the target pixel and the real frame. 将第一加权特征图和第二加权特征图分别输入无锚框策略的RPN模块的回归分支和分类分支中的卷积网络中,从回归分支输出回归图和分类图,从分类分支输出回归图和分类图;Input the first weighted feature map and the second weighted feature map into the convolutional networks in the regression branch and the classification branch of the RPN module without the anchor box strategy respectively, output the regression map and the classification map from the regression branch, and output the regression map and the classification map from the classification branch; 将分类分支和回归分支输出的两个回归图进行深度互相关运算,获得回归结果;Perform deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain the regression result; 将分类分支和回归分支输出的两个分类图进行深度互相关运算,获得分类结果;Perform deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain the classification result; 获取分类结果的最大值所处位置作为目标的预测位置;Get the position of the maximum value of the classification result as the predicted position of the target; 在回归结果中获取预测位置对应的预测边界框,作为目标的预测框。Get the predicted bounding box corresponding to the predicted position in the regression result as the predicted box of the target. 5.一种电子设备,其特征在于,所述电子设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1-3任一项所述的多尺度的无人机航拍目标跟踪方法方法。5. An electronic device, characterized in that the electronic device includes a processor and a memory, the memory storing at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to implement the multi-scale UAV aerial target tracking method as described in any one of claims 1-3. 6.一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1-3任一项所述的多尺度的无人机航拍目标跟踪方法方法。6. A computer-readable storage medium, characterized in that the storage medium stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set are loaded and executed by a processor to implement the multi-scale UAV aerial target tracking method as described in any one of claims 1-3.
CN202310429983.XA 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device Active CN116797628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310429983.XA CN116797628B (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310429983.XA CN116797628B (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Publications (2)

Publication Number Publication Date
CN116797628A CN116797628A (en) 2023-09-22
CN116797628B true CN116797628B (en) 2025-07-25

Family

ID=88036950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310429983.XA Active CN116797628B (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Country Status (1)

Country Link
CN (1) CN116797628B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437530A (en) * 2023-10-12 2024-01-23 中国科学院声学研究所 Synthetic aperture sonar twin matching identification method and system for small targets of interest

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348849A (en) * 2020-10-27 2021-02-09 南京邮电大学 Twin network video target tracking method and device
CN113298850A (en) * 2021-06-11 2021-08-24 安徽大学 Target tracking method and system based on attention mechanism and feature fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272530B (en) * 2018-08-08 2020-07-21 北京航空航天大学 Target tracking method and device for space-based monitoring scene
CN113436227A (en) * 2021-06-07 2021-09-24 南京航空航天大学 Twin network target tracking method based on inverted residual error
CN113807187B (en) * 2021-08-20 2024-04-02 北京工业大学 Unmanned aerial vehicle video multi-target tracking method based on attention feature fusion
CN115690156A (en) * 2022-10-30 2023-02-03 天翼电子商务有限公司 A UAV target tracking method based on multi-scale feature fusion
CN115953431A (en) * 2022-12-24 2023-04-11 上海交通大学 Multi-target tracking method and system for UAV aerial video

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348849A (en) * 2020-10-27 2021-02-09 南京邮电大学 Twin network video target tracking method and device
CN113298850A (en) * 2021-06-11 2021-08-24 安徽大学 Target tracking method and system based on attention mechanism and feature fusion

Also Published As

Publication number Publication date
CN116797628A (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN110807757B (en) Image quality assessment method, device and computer equipment based on artificial intelligence
CN111667399A (en) Training method for style transfer model, method and device for video style transfer
TW202036461A (en) System for disparity estimation and method for disparity estimation of system
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN112308200A (en) Neural network searching method and device
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN112561028B (en) Method for training neural network model, method and device for data processing
CN112215332A (en) Searching method of neural network structure, image processing method and device
CN111260687B (en) Aerial video target tracking method based on semantic perception network and related filtering
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
CN109214403A (en) Image-recognizing method, device and equipment, readable medium
CN114299358B (en) Image quality assessment method, device, electronic device and machine-readable storage medium
CN113256546A (en) Depth map completion method based on color map guidance
CN111027347A (en) A video recognition method, device and computer equipment
CN111126278A (en) A method for optimizing and accelerating object detection model for few-category scenes
CN114372999B (en) Object detection method, device, electronic device and storage medium
CN114358204A (en) Self-supervision-based no-reference image quality assessment method and system
CN114463223A (en) Image enhancement processing method and device, computer equipment and medium
CN113763420A (en) Target tracking method, system, storage medium and terminal equipment
Young et al. Feature-align network with knowledge distillation for efficient denoising
Liu et al. ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation
Lu et al. Siamese graph attention networks for robust visual object tracking
CN116797628B (en) Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device
CN113658091A (en) Image evaluation method, storage medium and terminal equipment
CN117151987A (en) Image enhancement method, device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant