CN110517285B

CN110517285B - Large-scene minimum target tracking based on motion estimation ME-CNN network

Info

Publication number: CN110517285B
Application number: CN201910718847.6A
Authority: CN
Inventors: 焦李成; 杨晓岩; 李阳阳; 唐旭; 程曦娜; 刘旭; 杨淑媛; 冯志玺; 侯彪; 张丹
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2021-09-10
Anticipated expiration: 2039-08-05
Also published as: CN110517285A

Abstract

The invention proposes a large-scene minimal target tracking method based on a motion estimation ME-CNN network, which solves the problem of using motion parameters for minimal target tracking without registration. The implementation steps are: obtaining the target motion estimation network ME-CNN The initial training set D of ; construct the network ME-CNN for estimating the target motion; calculate the loss function of the network ME-CNN with the target motion parameters; judge whether it is the initial training set; update the training label of the loss function; obtain the initial model for predicting the target motion position; Predict the model position; update the training data set with the corrected target position to complete the target tracking of one frame; obtain the remote sensing video target tracking result. The invention uses the deep learning network ME-CNN to predict the moving position of the target, avoids the problems of image registration in the tracking medium and large scenes, and difficult extraction of super-blurred target features, reduces the dependence of target features, and improves the accuracy of target tracking in super-blurred videos. Spend.

Description

Large-scene minimum target tracking based on motion estimation ME-CNN network

Technical Field

The invention belongs to the technical field of remote sensing video processing, relates to remote sensing video target tracking of a large-scene minimum target, and particularly relates to a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network. The method is used for safety monitoring, smart city construction, traffic facility monitoring and the like.

Background

Remote sensing target tracking is an important research direction in the field of computer vision, wherein target tracking of a remote sensing video with a small target and low resolution in a large scene shot by a moving satellite is a very challenging research problem. The remote sensing video of the large scene and the small target records the daily activity condition of a certain area for a period of time, because the height of satellite shooting is very high and covers most of a city, the resolution of the video is not high, the sizes of vehicles, ships and airplanes in the video are extremely small, the sizes of the vehicles in the video even reach about 3 to 3 pixels, the contrast with the surrounding environment is extremely low, and only one small bright spot can be observed by human eyes, so the problem of tracking the ultra-low pixel and extremely-small target belongs to the problem of tracking the large scene and the extremely-small target, and the difficulty is higher; and because the satellite for shooting the video continuously moves, the video has obvious deviation in one direction as a whole, and simultaneously partial regional scaling can occur due to the height of the region, so that the method of firstly carrying out image registration and then carrying out frame difference method to obtain the moving position of the target is difficult, and great challenge is brought to the remote sensing video target tracking of the extremely small target in a large scene.

Video target tracking is the prediction of target position and size in subsequent video frames after the target position and size in an initial frame of a video are given. At present, algorithms in the field of video tracking are mostly based on Neural networks (Neural networks) and related filtering (Correlation filters), wherein the Neural Network-based algorithms, such as the CNN-SVM method, have the main idea that a target is input into a multilayer Neural Network first, target characteristics are learned, tracking is performed by using a traditional SVM method, and the extracted target characteristics are learned through a large amount of training data and have higher discriminative performance than the traditional characteristics; the basic idea of an algorithm based on the relevant filtering, such as the KCF method, is to search a filtering template, convolve the image of the next frame with the filtering template, make the region with the largest response as the predicted target, and use the template and other search regions to perform convolution operation, and the search region with the largest response is the target position.

The algorithm of natural optical video tracking is difficult to be applied to the remote sensing video of a tiny small target in a large scene, and because the target is tiny and fuzzy in size, effective target features cannot be obtained by learning through a neural network. The traditional tracking method of the remote sensing video is not suitable for the video with continuous background deviation and partial area scaling, the technical methods of image registration and frame difference method cannot be realized, the contrast ratio of the target and the surrounding environment is extremely low, and the tracking method is easy to lose.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a large-scene small-target remote sensing video tracking method based on motion estimation, which is low in calculation complexity and higher in precision.

The invention relates to a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network, which is characterized by comprising the following steps of:

(1) obtaining an initial training set D of a minimum target motion estimation network ME-CNN:

taking the front F frame images of the original remote sensing data video A, continuously marking a boundary box for the same target of each image, and arranging the vertex coordinates of the upper left corner of each boundary box together according to the video frame number sequence to be used as a training set D;

(2) constructing a network ME-CNN for estimating the movement of the minimum target: the system comprises three parallel convolution modules for extracting different characteristics of training data, and a connecting layer, a full connecting layer and an output layer are sequentially stacked;

(3) calculating the loss function of the network ME-CNN by using the minimum target motion parameter: calculating to obtain the motion trend of the target according to the motion rule of the target, taking the motion trend as a training label corresponding to the target, and calculating the Euclidean spatial distance between the training label and the prediction result of the ME-CNN network as a loss function of the ME-CNN network optimization training;

(4) judging whether the training set is an initial training set: judging whether the current training set is an initial training set, if not, executing the step (5) and updating the training labels in the loss function; otherwise, if the training set is the initial training set, executing the step (6) and entering the circular training of the network;

(5) updating training labels in the loss function: when the current training set is not the initial training set, recalculating the training labels of the loss function by using the data of the current training set, calculating the training labels by using the minimum target motion parameters by using the calculation method, wherein the method is the same as the method in the step (3), the recalculated training labels participate in the ME-CNN training of the motion estimation network, and entering the step (6);

(6) obtaining an initial model M1 for predicting the movement position of the target: inputting the training set D into a target motion estimation network ME-CNN, training the network according to the current loss function, and obtaining an initial model M1 for predicting the motion position of the target;

(7) position result of the corrected prediction model: calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset;

(7a) obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion position_x,P_y) Based on the obtained target position (P)_x,P_y) Taking out a gray image block of the target from the image of the next frame, and normalizing to obtain a normalized target gray image block;

(7b) obtaining a target position offset: carrying out brightness grading on the normalized target gray image block, determining the position of a target in the image block by using a vertical projection method, and calculating the distance between the center position of the target and the center position of the image block to obtain the offset of the target position;

(7c) obtaining a corrected target position: correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all positions of the corrected target;

(8) and updating the training data set by using the corrected target position to complete target tracking of one frame: adding the obtained upper left corner position of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set D, completing the training of one frame, and obtaining the target position result of one frame;

(9) judging whether the current video frame number is less than the total video frame number: if the number of the video frames is less than the total number of the video frames, repeating the steps (4) to (9) in a circulating way, performing tracking optimization training on the target until all the video frames are traversed, continuing the training, otherwise, if the number of the video frames is equal to the total number of the video frames, finishing the training, and executing the step (10);

(10) obtaining a remote sensing video target tracking result: and the accumulated output is the remote sensing video target tracking result.

The invention solves the problems of high calculation complexity and low tracking precision of the existing video tracking algorithm.

Compared with the prior art, the invention has the following advantages:

(1) the ME-CNN adopted by the invention does not need to carry out image registration and then carry out frame difference method or complex image background modeling in the traditional method to obtain the motion trail of the target, can carry out analysis on a training set consisting of target positions of the previous F frame image through a neural network, and can realize network self-circulation training by network prediction without manually marking target position labels in subsequent video frames, thereby greatly reducing the complexity of a tracking algorithm and improving the practicability of the algorithm.

(2) The algorithm adopted by the invention automatically corrects the target position of the remote sensing video by combining the ME-CNN network and the auxiliary position offset method, modifies the loss function of the motion estimation network according to the motion rule of the target, reduces the calculated amount of the network and improves the robustness of target tracking.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

fig. 2 is a schematic structural diagram of an ME-CNN network proposed by the present invention;

FIG. 3 is a graph comparing the predicted trajectory results of the present invention for very small targets in a large scene with the standard target trajectory, where the predicted results of the present invention are green curves and red is the accurate target trajectory.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

Example 1

The remote sensing video tracking of the large-scene tiny target plays an important role in safety monitoring, smart city construction, traffic facility monitoring and the like. The remote sensing video researched by the invention is a remote sensing video with a small target and low resolution in a large scene shot by a mobile satellite. The video tracking method has the advantages that the researched video tracking target is extremely fuzzy, the target is extremely small, the contrast with the surrounding environment is low, the human eyes can hardly see that the target is a vehicle under the condition that the target does not move, the video can be subjected to image translation and partial zooming due to the movement of a satellite and the change of the altitude of a shooting area, the target tracking difficulty of the video is greatly improved compared with that of the clear video, and the video tracking method is also a challenge of remote sensing video tracking. The existing methods mainly comprise two methods, one method is to extract target characteristics by using neural network learning, extract a plurality of search frames in the next frame and select the frame with the highest target characteristic score as the position of a target. The other method is that firstly, the image is registered, then the frame difference method is carried out to obtain the target motion track, then a filtering template is searched, the image of the next frame and the filtering template are subjected to convolution operation, and the region with the largest response is the predicted target. Therefore, the invention provides a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network by research aiming at the current situations, and referring to fig. 1, the method comprises the following steps:

the method comprises the steps of taking front F frame images of an original remote sensing data video A, selecting only one target in each image, and continuously marking a boundary frame for the same target of each image.

(2) Constructing a network ME-CNN for estimating the movement of the minimum target: the ME-CNN network comprises three convolution modules which are connected in parallel and extract different characteristics of training data to obtain different motion characteristics of a target, and then different extracted motion characteristics, a full connection layer and an output layer are sequentially stacked and fused by a connection layer to obtain an output result, so that the ME-CNN network is formed. According to the method, three convolution modules are used for obtaining different motion characteristics of the target, a single convolution module is difficult to process to obtain the characteristics of the whole training set, and if the network layer is deep, the problem of gradient disappearance occurs, so that the method widens the network, extracts the characteristics of the training set under different conditions in multiple scales, reduces the complexity of the network, and accelerates the network speed. Because the video of the invention continuously moves and shifts and partial areas are zoomed due to different heights of regions, methods such as an image registration and frame difference method, background modeling and the like cannot be used aiming at the video, and at the moment, a target motion track can be obtained by using an ME-CNN network.

(3) Calculating the loss function of the network ME-CNN by using the minimum target motion parameter: the method comprises the steps of calculating the motion trend of a target according to the motion rule of the target, using the motion trend as a training label corresponding to the target, and calculating the Euclidean space distance between the training label and the prediction result of the ME-CNN network to be used as a loss function for optimizing the ME-CNN network.

(4) Judging whether the training set is an initial training set: and (5) judging whether the current training set is an initial training set, if not, executing the step (5), updating the training labels in the loss function, and further participating in network training. Otherwise, if the current training set is the initial training set, the step (6) is executed, and the circular training of the network is entered.

(5) Updating training labels in the loss function: because the training set D is continuously updated in the subsequent step (8), the training labels in the loss function need to be continuously adjusted according to the updated training set D in the training process, when the current training set is not the initial training set, the training labels of the loss function should be recalculated by using the data of the current training set, and the calculation method is the same as the method of the step (3) in that the training labels are calculated by using the minimum target motion parameters; and (5) the recalculated training label participates in the ME-CNN training of the motion estimation network, and the step (6) is entered.

(6) Obtaining an initial model M1 for predicting the movement position of the target: and inputting the training set D into the object motion estimation network ME-CNN, training the network according to the current loss function, and obtaining an initial model M1 for predicting the motion position of the object.

(7) Position result of the corrected prediction model: and calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset.

(7a) Obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion position_x,P_y) Based on the obtained target position (P)_x,P_y) The gray image blocks of the target are taken out from the image of the next frame and are normalized to obtain the normalized gray image blocks of the target, and because the size of the target is extremely small and the contrast with the surrounding environment is extremely low, the method for judging the offset by using the neural network has poor effect on the image blocks, a smaller target frame is firstly taken, and then the method for judging the offset in the frame is better.

(7b) Obtaining a target position offset: and (3) carrying out brightness grading on the normalized target gray image block, displaying the target and the road at different brightness, determining the position of the target in the image block by using a vertical projection method because the surrounding environment of the road and the contrast of the target are extremely low, and calculating the distance between the center position of the target and the center position of the image block to obtain the target position offset.

(7c) Obtaining a corrected target position: and correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all the corrected positions of the target, including the position of the upper left corner of the target.

(8) And updating the training data set by using the corrected target position to complete target tracking of one frame: and adding the obtained position of the upper left corner of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set D, completing the training of one frame, and obtaining the target position result of one frame.

(9) And (4) judging whether the current video frame number is less than the total video frame number, if so, circularly repeating the steps (4) to (9), updating the model parameters again, improving the model adaptability, performing tracking optimization training on the target until all the video frames are traversed, continuing the training, otherwise, if the current video frame number is equal to the total video frame number, ending the training, and executing the step (10).

(10) Obtaining a remote sensing video target tracking result: and after the training is finished, the accumulated target position output is the remote sensing video target tracking result.

The ME-CNN adopted by the invention does not need to carry out image registration and then carry out frame difference method or complex image background modeling in the traditional method to obtain the target motion track, and the new algorithm provided by the invention can effectively extract the target motion characteristics by analyzing the training set formed by the target positions of the previous F frame images through the neural network. Because the gradient disappears and other problems can occur when the network is too deep, the ME-CNN network of multi-scale analysis is adopted to predict the motion trend of the target, and the target position label in the subsequent video frame is not required to be marked manually, so that the network self-circulation training can be realized, the complexity of the tracking algorithm is greatly reduced, the practicability of the algorithm is improved, and the target position can be quickly and accurately found through the motion estimation network of the target without image registration. The method is characterized in that the ME-CNN network and the auxiliary position offset method are combined to automatically judge the target position of the remote sensing video, the movement speed of the target is obtained according to the movement condition of the target, the possible movement trend of the target is analyzed, the loss function of the movement estimation network is modified, and the robustness of target tracking is improved.

The method utilizes a deep learning-based method to carry out motion analysis on the super-blurred target, predicts the next prediction direction of the super-blurred target, corrects the motion estimation network by the position offset, and can track the target without subsequent labels, thereby avoiding the problems of image registration of large scenes in tracking and difficult extraction of the super-blurred target characteristics, obviously improving the accuracy of target tracking in the super-blurred video, and being also suitable for tracking in other various remote sensing videos.

Example 2

The method for tracking a large-scene minimum target remote sensing video based on a motion estimation ME-CNN network is the same as that in embodiment 1, and the method for constructing the network ME-CNN for estimating the minimum target motion described in step (2) comprises the following steps as shown in FIG. 2:

(2a) overall structure of the motion estimation network: the invention relates to a motion estimation network ME-CNN, which comprises three convolution modules connected in parallel. The invention constructs different motion characteristics extracted by using the connection layer fusion in the network ME-CNN for estimating the minimum target motion, uses the full connection layer to refine and analyze, and obtains a result through the output layer output.

(2b) Structure of three convolution modules in parallel: the convolution modules in parallel are convolution module I, convolution module I and convolution module I respectively, wherein

The convolution module I comprises a locally connected Locallyconnected1D convolution layer, and the step length is 2 to extract the coordinate position information of the target;

the convolution module I comprises cavity convolution, and the step length is 1;

the convolution module I comprises one-dimensional convolution with the step length of 2;

the convolution modules I, I and I obtain the position characteristics of different scales of the target to obtain three output data, and then the outputs of the three convolution modules are connected in series to obtain a fusion convolution result; and inputting the full connection layer and the output layer to obtain a final prediction result. According to the method, three convolution modules are used for obtaining different motion characteristics of the target, a single convolution module is difficult to process to obtain the characteristics of the whole training set, and if the network layer is deep, the problem of gradient disappearance occurs, so that the method widens the network, extracts the characteristics of the training set under different conditions in multiple scales, reduces the complexity of the network, and accelerates the network speed. Because the video of the invention continuously moves and shifts and partial areas are zoomed due to different heights of regions, methods such as an image registration and frame difference method, background modeling and the like cannot be used aiming at the video, and at the moment, a target motion track can be obtained by using an ME-CNN network.

Example 3

The method for tracking the large-scene minimum target remote sensing video based on the motion estimation ME-CNN network is the same as the embodiment 1-2, the loss function of the network ME-CNN is calculated by using the minimum target motion parameters in the step 3, the motion condition of the target is roughly analyzed by processing the data of the training set D, and a certain guiding function is provided for the optimization direction of the motion estimation network ME-CNN, and the method comprises the following steps:

(3a) acquiring the target displacement of a training set D: taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, and subtracting the data from the data of the first line of the training set D to obtain the target displacement S between the F-th frame, the F-2 nd frame and the F-4 th frame and the first frame in sequence₁、S₂、S₃。S₁Is the target displacement between the F-th frame and the first frame, S₂Is the target displacement between the F-2 frame and the first frame, S₃Is the target displacement between the F-4 th frame and the first frame. If the training set is not the initial training set but the training set D which is updated for i times, the frame number corresponding to each row of the training set is correspondingly changedChanging the frame into a 1+ i frame, a 2+ i frame, … … and an F + i frame, taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, subtracting the data of the first line of the training set D respectively to obtain the target displacement of the F + i frame, the F + i-2 th frame and the F + i-4 th frame respectively, and sequentially obtaining the displacement S between the target displacement of the F + i frame, the F + i-2 th frame and the F + i-4 th frame and the first frame₁、S₂、S₃。

(3b) Obtaining the motion trend of the target:

according to the motion rule of the target, the obtained target displacement is utilized to calculate the motion trend (G) of the target in the x and y directions of the image coordinate system according to the following formula_x,G_y)；

V₁＝(S₁-S₂)/2

V₂＝(S₂-S₃)/2

a＝(V₁-V₂)/2

G＝V₁+a/2

The invention uses an image coordinate system, which takes the upper left corner of an image as an origin, the horizontal right direction is the x direction, and the vertical downward direction is the y direction. In the above formula, V₁Is a displacement S₁And S₂Velocity of movement of the target, V₂Is a displacement S₂And S₃The moving speed of the target, a is the moving acceleration, and G is the moving trend of the target.

(3c) Constructing a loss function of a motion estimation network ME-CNN:

calculating to obtain the motion trend of the target according to the motion rule of the target, using the motion trend as a training label corresponding to the target, and calculating to obtain the motion trend (G) of the target_x,G_y) And estimating the predicted position (P) of the network ME-CNN output_x,P_y) The Euclidean spatial distance between the two networks is constructed as a loss function of the motion estimation network ME-CNN;

in the formula, G_xIs the moving trend of the target in the x direction under the image coordinate system, G_yFor the image sittingThe y-direction object motion tendency, P_xFor the prediction result of the motion estimation network in the x-direction in the image coordinate system, P_yAnd estimating the prediction result of the network in the y direction under the image coordinate system for the motion.

A comprehensive example is given below to further illustrate the invention

Example 4

The method for tracking the remote sensing video of the large-scene tiny target based on the motion estimation ME-CNN network is the same as the embodiment 1-3,

referring to fig. 1, a large-scene minimal target remote sensing video tracking method based on a motion estimation ME-CNN network includes the following steps:

taking the former F frame images of the original remote sensing data video A, continuously marking a boundary frame for a target of each image, and superposing the top left corner vertex coordinates of each boundary frame together to form a training set D, wherein the training set D is a matrix of F rows and 2 columns, each row corresponds to the target coordinates of one frame in the video, the position of the target can be represented by the coordinates of the top left corner vertex and can also be represented by the center coordinates, the analysis of the motion condition of the target is not influenced, and the minimum target is simply called as the target in the invention.

(2) Constructing a network ME-CNN for estimating the movement of the minimum target: the method comprises three convolution modules which are connected in parallel and used for extracting different characteristics of training data so as to obtain different motion characteristics of a target, a single convolution layer is difficult to process so as to obtain the characteristics of the whole training set, if a network layer is deep, the problem of gradient disappearance can occur, therefore, the network is widened, the characteristics of different conditions of the training set are extracted in multiple scales, the complexity of the network can be reduced, the network speed is accelerated, and then connecting layers are sequentially stacked for fusing the extracted motion characteristics and the whole connecting layers to obtain results by adding an analysis layer and an output layer.

(2a) Overall structure of the motion estimation network: the motion estimation network ME-CNN comprises three convolution modules connected in parallel, and a connection layer, a full connection layer and an output layer are sequentially stacked;

the convolution modules I, I and I obtain the position characteristics of different scales of the target to obtain three output data, and then the outputs of the three convolution modules are connected in series to obtain a fusion convolution result; and inputting the full connection layer and the output layer to obtain a final prediction result.

(3) Constructing a loss function of the ME-CNN of the minimum target motion estimation network: calculating to obtain the motion trend of the target according to the motion rule of the target, taking the motion trend as a training label corresponding to the target, and calculating the Euclidean spatial distance between the motion trend and the prediction result of the ME-CNN network as a loss function of the ME-CNN network;

(3a) acquiring the target displacement of a training set D: if the training set is an initial training set, taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, and subtracting the data of the first line of the training set D respectively to obtain the target displacement S between the F-th frame, the F-2 nd frame and the F-4 th frame and the first frame in sequence₁、S₂、S₃。S₁Is the target displacement between the F-th frame and the first frame, S₂Is the target displacement between the F-2 frame and the first frame, S₃Is the target displacement between the F-4 th frame and the first frame. If the training set is not the initial training set but the training set D which is updated for i times, the frame number corresponding to each line of the training set is changed correspondingly and is changed into the 1+ i frame, the 2+ i frame, … … and the F + i frame, the data of the F-th line, the F-2 line and the F-4 line of the training set D are taken out and subtracted from the data of the first line of the training set D respectively, and the target displacement of the F + i frame, the F + i-2 frame and the F + i-4 frame is obtained and is S sequentially between the first frame and the F + i frame₁、S₂、S₃。

(3b) Obtaining the motion trend of the target:

according to the motion rule of the target, the obtained training data target displacement is utilized to calculate the motion trend (G) of the target in the x and y directions of the image coordinate system according to the following formula_x,G_y)。

V₁＝(S₁-S₂)/2

V₂＝(S₂-S₃)/2

a＝(V₁-V₂)/2

G＝V₁+a/2

(3c) Constructing a loss function of a motion estimation network ME-CNN:

calculated target motion tendency (G)_x,G_y) And estimating the predicted position (P) of the network output_x,P_y) The Euclidean spatial distance between the two networks is constructed as a loss function of the motion estimation network ME-CNN.

(4) Updating training labels in the loss function: because the training set D is continuously updated in the subsequent step (7), the training labels in the loss function need to be continuously adjusted according to the updated training set D in the training process, and participate in the ME-CNN training of the motion estimation network.

(5) Obtaining an initial model M1 for predicting the movement position of the target: and inputting the training set D into the object motion estimation network ME-CNN, training the network according to the loss function, and obtaining an initial model M1 for predicting the motion position of the object.

(6) Position result of the corrected prediction model: and calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset.

(6a) Obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion position_x,P_y) Based on the obtained target position (P)_x,P_y) Taking out the gray image block of the target from the image of the next frame, and normalizing to obtain the normalized targetThe gray scale image block has the advantages that the size of the target is extremely small, the contrast with the surrounding environment is extremely low, and the effect of the method for judging the offset by using the neural network is poor, so that a smaller target frame is obtained firstly, and then the method for judging the offset in the frame is better.

(6b) Obtaining a target position offset: and carrying out brightness grading on the normalized target gray image block, determining the position of a target in the image block by using a vertical projection method, and calculating the distance between the center position of the target and the center position of the image block to obtain the target position offset.

(6c) Obtaining a corrected target position: and correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all the corrected positions of the target, including the position of the upper left corner of the target.

(7) And updating the training data set by using the corrected target position to complete target tracking of one frame: and adding the obtained position of the upper left corner of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set, completing the training of one frame, and obtaining the target position result of one frame.

(8) Obtaining a remote sensing video target tracking result: and (5) repeating the step (4) to the step (7) circularly, continuously using the updated training set to obtain a training label again according to the method in the step (3), updating the network model, repeating iteration, and carrying out tracking approximate training on the target until all video frames are traversed, wherein the accumulated output is the tracking result of the remote sensing video target.

In the embodiment, the motion estimation model of the target can also extract road information of the target through the target motion of the previous frames, find the city where the target with the same longitude and latitude is located on the map, predict the target motion condition by matching the corresponding road condition through the road, fully utilize the three-dimensional information of the road, and accurately track the target under the condition that the road height changes violently and partial zooming exists in the video; the auxiliary position offset of the target can also be obtained by training the neural network, but the target and the surrounding environment need to be processed in advance to obtain image blocks with higher contrast, so that the neural network can be trained.

The technical effects of the invention are further explained by combining simulation tests as follows:

example 5

The method for tracking the remote sensing video of the large-scene tiny target based on the motion estimation ME-CNN network is the same as the embodiment 1-4,

simulation conditions and contents:

the simulation platform of the invention is as follows: intel Xeon CPU E5-2630v3CPU with a main frequency of 2.40GHz, 64GB running memory, Ubuntu16.04 operating system and Keras and Python software platforms. A display card: GeForce GTX TITAN X/PCIe/SSE2 × 2.

The invention uses a remote sensing video of a Libya Delner area shot by a Jilin No. I video satellite, a vehicle of a former 10-frame image is taken as a target, a frame is marked on the target in the image, and the position of the top left vertex of the frame is taken as a training set DateSet. The target video is tracked and simulated by the method and the conventional target tracking method based on KCF respectively.

Simulation content and results:

the comparison method, namely the conventional KCF-based target tracking method, is used for carrying out experiments under the simulation conditions, namely the comparison method and the comparison method are used for tracking the vehicle target in the remote sensing video of the Libya Delner area, a comparison graph of the predicted target track result (green curve) and the accurate target track (red curve) of the ME-CNN network is obtained and shown in figure 3, and the results shown in the table 1 are obtained.

TABLE 1 List of remote sensing video target tracking results in Libiadral area

And (3) simulation result analysis:

in table 1, Precision represents the area overlapping rate of the target position and the tag position predicted by the ME-CNN network, IOU represents the percentage of the average euclidean distance between the center position of the bounding box and the center position of the tag being smaller than a given threshold, in this example, the given threshold is selected to be 5, KCF represents the comparison method, and ME-CNN represents the method of the present invention.

Referring to table 1, it can be seen from the comparison of the data in table 1 that the present invention greatly improves the tracking accuracy, the present invention improves Precision from 63.21% to 85.63%,

from table 1, it can be seen that the percentage IOU where the average euclidean distance between the center position of the bounding box and the center position of the tag is less than the given threshold improves 58.72% of the KCF-based target tracking method of the comparative method to 76.51%.

Referring to fig. 3, the red curve in fig. 3 is a standard target trajectory curve, the green curve is a tracking prediction estimation curve for the same target by using the method, the extremely small target in the large scene is displayed in the green box, and comparing the two curves shows that the two curves are consistent in height and basically coincide, which proves that the method has high tracking accuracy.

In short, the method for tracking the large-scene minimal target remote sensing video based on the motion estimation ME-CNN network provided by the invention can improve the tracking accuracy under the conditions that a shooting satellite continuously moves, the video has integral translation and partial scaling, the resolution of the video is extremely low and the target size is extremely small, solves the problem that the minimal target tracking is carried out by using motion parameters without registration, and comprises the following implementation steps: obtaining an initial training set D of a minimum target motion estimation network ME-CNN; constructing a network ME-CNN for estimating the movement of the minimum target; calculating a loss function of the network ME-CNN by using the minimum target motion parameter; judging whether the training set is an initial training set; updating the training labels in the loss function; obtaining an initial model M1 for predicting the movement position of the target; correcting the position result of the prediction model; updating the training data set by using the corrected target position to complete target tracking of one frame; judging whether the current video frame number is less than the total video frame number; and obtaining a remote sensing video target tracking result. The invention uses the deep learning network ME-CNN to predict the target motion position, avoids the problems of large scene image registration and difficult extraction of the super-fuzzy target characteristics in the tracking of the existing method, reduces the dependency on the target characteristics, obviously improves the accuracy of target tracking in the super-fuzzy video, and is also suitable for tracking in other various remote sensing videos.

Claims

1. a large scene minimal target tracking method based on motion estimation ME-CNN network, is characterized in that, comprises the steps:

(1) Obtain the initial training set D of the minimal target motion estimation network ME-CNN:

Take the first F frame images of the original remote sensing data video A, continuously mark the bounding box for the same target in each image, and arrange the coordinates of the vertices of the upper left corner of each bounding box in the order of the number of video frames as the training set D;

(2) Build a network ME-CNN for estimating the motion of extremely small objects: it includes three parallel convolution modules that extract different features of the training data, and then stack the connection layer, the fully connected layer and the output layer in sequence;

(3) Calculate the loss function of the network ME-CNN with the minimal target motion parameters: Calculate the motion trend of the target according to the motion law of the target, and use it as the training label corresponding to the target, and then calculate the training label and the ME-CNN network. The Euclidean space distance between the prediction results, as the loss function of ME-CNN network optimization training;

(4) Judging whether it is the initial training set: judge whether the current training set is the initial training set, if it is not the initial training set, execute step (5) to update the training label in the loss function; otherwise, if it is the initial training set, execute step ( 6), enter the loop training of the network;

(5) Update the training label in the loss function: When the current training set is not the initial training set, use the data of the current training set to recalculate the training label of the loss function. The calculation method uses the minimal target motion parameter to calculate the training label, and the steps The method of (3) is the same, the training label obtained by recalculation, participates in the training of the motion estimation network ME-CNN, and enters step (6);

(6) Obtain the initial model M1 for predicting the motion position of the target: input the training set D into the target motion estimation network ME-CNN, train the network according to the current loss function, and obtain the initial model M1 for predicting the motion position of the target;

(7) Modify the position result of the prediction model: calculate the auxiliary position offset of the target, and use the offset to correct the position result predicted by the motion estimation network ME-CNN;

(7a) Obtain the target grayscale image block: obtain the target position (P _x , P _y ) of the next frame according to the initial model M1 for predicting the target motion position, and according to the obtained target position (P _x , P _y ) in the next frame The grayscale image block of the target is taken out from the image and normalized to obtain the normalized target grayscale image block;

(7b) Obtaining the offset of the target position: perform brightness classification on the normalized target grayscale image block, use the vertical projection method to determine the position of the target in the image block, and calculate the difference between the center position of the target and the position of the center of the image block. distance, that is, to get the target position offset;

(7c) Obtain the corrected target position: use the obtained target position offset to correct the motion estimation network ME-CNN to predict the position of the target, and obtain all the corrected positions of the target;

(8) Update the training data set with the corrected target position to complete the target tracking of one frame: add the obtained position of the upper left corner of the target to the last row of the training set D, and remove the first row of the training set D, and perform a one-time operation , obtained a revised and updated training set D, completed one frame of training, and obtained the target position result of one frame;

(9) Determine whether the current number of video frames is less than the total number of video frames: if it is less than the total number of video frames, repeat steps (4) to (9) cyclically, and perform target tracking optimization training until all video frames are traversed. Equal to the total number of video frames, end the training, and execute step (10);

(10) Obtain the remote sensing video target tracking result: the accumulated output is the remote sensing video target tracking result.

2. the large-scene minimal target tracking method based on motion estimation ME-CNN network according to claim 1, is characterized in that: the network ME-CNN described in step (2) is constructed to estimate the motion of minimal target, including There are the following steps:

(2a) The overall structure of the motion estimation network: the motion estimation network ME-CNN, including three convolution modules in parallel, and then stacking the connection layer, the fully connected layer, and the output layer in sequence;

(2b) The structure of three convolution modules in parallel: the three convolution modules in parallel are respectively convolution module Ι, convolution module ΙΙ and convolution module ΙΙΙ, wherein

The convolution module I includes a locally connected LocallyConnected1D convolutional layer, and the step size is 2 to extract the coordinate position information of the target;

The convolution module ΙΙ contains atrous convolution with a stride of 1;

The convolution module ΙΙΙ includes a one-dimensional convolution with a stride of 2;

The convolution modules Ι, ΙΙ and ΙΙΙ obtain the positional features of the target at different scales, obtain three output data, and then connect the outputs of the three convolution modules in series to obtain the fusion convolution result; then input the fully connected layer and the output layer to obtain The final prediction result.

3. the large-scene minimal target tracking method based on motion estimation ME-CNN network according to claim 1, is characterized in that: the loss function of calculating network ME-CNN with minimal target motion parameter described in step 3, It includes the following steps:

(3a) Obtaining the target displacement of the training set D: take out the data of the Fth row, the F-2th row, and the F-4th row of the training set D, and subtract them from the data of the first row of the training set D respectively to obtain the Fth frame, The target displacements between the F-2 frame and the F-4 frame and the first frame are respectively S ₁ , S ₂ , and S ₃ ;

(3b) Get the movement trend of the target:

According to the movement law of the target, using the obtained target displacement, calculate the movement trend (G _x , G _y ) of the target in the x and y directions of the image coordinate system according to the following formulas respectively;

V ₁ =(S ₁ -S ₂ )/2

V ₂ =(S ₂ -S ₃ )/2

a=(V ₁ -V ₂ )/2

G=V ₁ +a/2

In the formula, V ₁ is the target movement speed between the displacements S ₁ and S ₂ , V ₂ is the target movement speed between the displacements S ₂ and S ₃ , a is the movement acceleration, and G is the movement trend of the target;

(3c) Construct the loss function of the motion estimation network ME-CNN:

The movement _trend of the target _is calculated according to the movement law of the target, and it _is used as the training label corresponding to the target. _y ) the Euclidean space distance between, constructed as the loss function of the motion estimation network ME-CNN;

In the formula, G _x is the target motion trend in the x direction under the image coordinate system, G _y is the target motion trend in the y direction under the image coordinate system, P _x is the prediction result of the motion estimation network in the x direction under the image coordinate system, P _y It is the prediction result of the motion estimation network in the y direction in the image coordinate system.