CN110473227B

CN110473227B - Target tracking method, device, equipment and storage medium

Info

Publication number: CN110473227B
Application number: CN201910776111.4A
Authority: CN
Inventors: 廖家聪; 卢毅; 詹皓云
Original assignee: Atlas Future Nanjing Artificial Intelligence Research Institute Co ltd
Current assignee: Atlas Future Nanjing Artificial Intelligence Research Institute Co ltd
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2022-03-04
Anticipated expiration: 2039-08-21
Also published as: CN110473227A

Abstract

The application provides a target tracking method, a target tracking device, target tracking equipment and a storage medium. The target tracking method comprises the following steps: acquiring a first boundary area image in a previous frame image in a video stream; acquiring a second boundary area image with the same size as the first boundary area image on the current frame image in the video stream; carrying out relevancy filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates; respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in size in the first boundary area image and the second boundary area image according to the candidate coordinates; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate region image and the second candidate region image. The method and the device can realize real-time accurate tracking of the target in the low frame rate frequency flow on the premise of lower hardware requirements.

Description

Target tracking method, device, equipment and storage medium

Technical Field

The present application relates to the field of target tracking technologies, and in particular, to a target tracking method, apparatus, device, and storage medium.

Background

At present, many service scenes need to track a specific target in a surveillance video or a mobile phone video, for example, face recognition in the surveillance scene sometimes needs to be combined with a video sequence of the target for recognition instead of only a certain frame, and tracking the target in a video stream is also the basis of some high-order visual problems, such as behavior analysis and behavior prediction.

The existing Tracking technology Based on deep learning is mainly divided into two types, one type is Detection-Based Tracking (DBT, Tracking Based on Detection), the method is mainly Based on a target Detection technology, and then target matching is carried out through a twin network. The other method is Detection-Free Tracking (DFT), which needs to be initialized in a specific manner and then locate the target in the subsequent video frame, but this method has many advantages that it can implement a real-time method and has a low hardware cost, but this method has a requirement on the frame rate of the video, and is often poor in the video with a low frame rate, and this method has an obvious short board, and it cannot directly determine whether the target is lost or not, and it cannot determine whether a new target appears.

However, in real application scenes, such as a monitoring scene and a mobile phone video, such scenes have two characteristics, namely, a video frame rate is low (for example, the frame rate of a camera is generally about 15fps, and the performance is poor by directly using a DFT method), and a hardware resource is in short supply (if a DBT method is used, the time is very long, and the real-time performance cannot be met). The existing DBT and DFT algorithms can not realize the tracking detection of the target under the scene.

Therefore, how to realize target tracking in low frame rate video under low hardware requirement becomes a problem to be solved urgently.

Disclosure of Invention

In view of this, the target tracking method, apparatus, device and storage medium provided in the embodiments of the present application can implement real-time accurate tracking of a target in a low frame rate frequency stream on the premise of a lower hardware requirement.

In a first aspect, an embodiment of the present application provides a target tracking method, where the method includes: acquiring a first boundary area image in a previous frame image in a video stream, wherein the first boundary area image comprises a boundary frame image of the previous frame image, and the boundary frame image of the previous frame image comprises an image of a tracking target; acquiring a second boundary area image with the same size as the first boundary area image on a current frame image in the video stream, wherein the position of the second boundary area image in the current frame image is the same as the position of the first boundary area image in the previous frame image; performing relevance filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest relevance between the first boundary area image and the second boundary area image; respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in size in the first boundary area image and the second boundary area image according to the candidate coordinates; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate region image and the second candidate region image.

In the implementation process, the present application obtains a first boundary area image in a previous frame image in a video stream and obtains a second boundary area image on a current frame image in the video stream, the size of which is equal to that of the first boundary area image, so that only one second boundary area image needs to be cropped on the current frame image, and no multiple images need to be cropped, thereby reducing hardware resource overhead, further, after candidate coordinates are determined, according to a first candidate area image and a second candidate area image of which the candidate coordinates are equal to that of the boundary image in the first boundary area image and the second boundary area image, whether a tracking target in the current frame image is the same target as a tracking target in the previous frame image is determined through the first candidate area image and the second candidate area image, therefore, the target can be accurately tracked in real time in the low frame rate frequency flow, and whether the tracked target disappears can be accurately determined.

With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where the method further includes: and performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, wherein the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

In the implementation process, the second candidate area image is modified to determine the bounding box image of the current frame image, so that the position of the tracking target in the current frame image and the size of the bounding box can be accurately obtained, an application scene of the target in the form change of video stream data can be better handled, and the tracking accuracy of the tracking target is improved. Furthermore, the boundary frame image of the current frame image is used for determining the target tracking of the image of the current frame image and the image of the next frame image, so that the application scene of the method and the device for tracking the target in the video stream data can better cope with the shape change of the target in the video stream data.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present application provides a second possible implementation manner of the first aspect, where performing modification processing on the second candidate region image to determine a bounding box image of the current frame image includes: carrying out border frame correction processing on the second candidate area image to obtain correction parameters of a border frame, wherein the correction parameters comprise the offset of the border frame; and determining the boundary frame image of the current frame image according to the correction parameters.

In the implementation process, the correction parameter of the boundary frame is obtained by performing boundary frame correction processing on the second candidate region image, wherein the correction parameter comprises the offset of the boundary frame; and further, the application scene of the form change of the target in the video stream data can be better responded by using the boundary frame image of the current frame image for determining the boundary frame image of the target tracking of the later frame image of the current frame image.

With reference to the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, where performing correlation filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates includes: inputting the first boundary area image and the second boundary area image into the same preset convolutional neural network respectively to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network; carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree; and taking the position with the maximum correlation degree as a candidate coordinate.

In the implementation process, the first boundary area image and the second boundary area image are respectively input into the same preset convolutional neural network to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network; carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree; and taking the position with the maximum correlation degree as a candidate coordinate. The position with the maximum degree of correlation in the first characteristic diagram and the second characteristic diagram is used as a candidate coordinate, so that the candidate coordinate with the highest degree of correlation in the two frames of images can be obtained more accurately, and a tracking target can be tracked accurately in real time.

With reference to the first aspect, an embodiment of the present application provides a fourth possible implementation manner of the first aspect, where the determining, by using the first candidate region image and the second candidate region image, whether a tracking target in the current frame image and a tracking target in the previous frame image are the same target includes: splicing the first candidate area image and the second candidate area image to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence.

In the implementation process, the first candidate region image and the second candidate region image are spliced to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence, so that whether the tracking target is the same target can be quickly judged, and whether the tracking target disappears can be further determined.

In a second aspect, an embodiment of the present application provides a target tracking apparatus, including: a first acquisition unit configured to acquire a first boundary area image in a preceding frame image in a video stream, the first boundary area image including a boundary frame image of the preceding frame image, the boundary frame image of the preceding frame image including an image of a tracking target; a second obtaining unit, configured to obtain a second boundary area image that is equal to the first boundary area image in size on a current frame image in the video stream, where a position of the second boundary area image in the current frame image is the same as a position of the first boundary area image in the previous frame image; the first processing unit is used for carrying out correlation degree filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest correlation degree between the first boundary area image and the second boundary area image; the second processing unit is used for respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in size in the first boundary area image and the second boundary area image according to the candidate coordinates; and the target tracking unit is used for determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate area image and the second candidate area image.

In combination with the second aspect, this application provides a first possible implementation manner of the second aspect, and the apparatus further includes: and the position processing unit is used for performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, and the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

With reference to the first possible implementation manner of the second aspect, an embodiment of the present application provides a second possible implementation manner of the second aspect, where the position processing unit is further configured to: carrying out border frame correction processing on the second candidate area image to obtain correction parameters of a border frame, wherein the correction parameters comprise the offset of the border frame; and determining the boundary frame image of the current frame image according to the correction parameters.

With reference to the second aspect, an embodiment of the present application provides a third possible implementation manner of the second aspect, where the first processing unit is further configured to: inputting the first boundary area image and the second boundary area image into the same preset convolutional neural network respectively to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network; carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree; and taking the position with the maximum correlation degree as a candidate coordinate.

With reference to the second aspect, an embodiment of the present application provides a fourth possible implementation manner of the second aspect, where the target tracking unit is further configured to: splicing the first candidate area image and the second candidate area image to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence.

In a third aspect, an electronic device provided in an embodiment of the present application includes: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the object tracking method according to any one of the first aspect when executing the computer program.

In a fourth aspect, a storage medium is provided in an embodiment of the present application, where the storage medium has instructions stored thereon, and when the instructions are executed on a computer, the instructions cause the computer to perform the target tracking method according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which when run on a computer, causes the computer to execute the target tracking method according to any one of the first aspect.

Additional features and advantages of the disclosure will be set forth in the description which follows, or in part may be learned by the practice of the above-described techniques of the disclosure, or may be learned by practice of the disclosure.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a target tracking method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network implementing the target tracking method shown in FIG. 1;

FIG. 3 is a flow chart of another target tracking method provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a target tracking apparatus according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The above-mentioned drawbacks in the prior art are considered by the applicant to be the result of practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the embodiments of the present application in the afternoon should be the contributions of the applicant to the present application in the process of the present application.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, which is a flowchart of a target tracking method provided in an embodiment of the present application, it should be understood that the method shown in fig. 1 may be executed by a target tracking apparatus, where the target tracking apparatus may correspond to the electronic device shown in fig. 5, and the electronic device may be various devices capable of executing the method, such as a personal computer, a smart phone, a server, and the like, for example, but the embodiment of the present application is not limited thereto, and specifically includes the following steps:

in step S101, a first boundary area image in a previous frame image in a video stream is acquired.

Optionally, the video stream is a low frame rate video stream, for example, a video stream with a frame rate lower than 48 frames per second is a low frame rate video stream. For example, a video stream with a frame rate of 15 frames per second or 5 frames per second.

Alternatively, the previous frame image may be a temporally immediately previous frame in the current frame image, or may not be a temporally immediately previous frame. For example, if the current frame is the t-th frame in the video stream, the previous frame may be any one of the 1 st frame to the t-1 st frame in the video stream, for example, the previous frame may be the t-1 frame, or the t-2 frame, etc. Wherein t is a positive integer.

Optionally, the first boundary area image comprises a bounding box image of the previous frame image, the bounding box image of the previous frame image comprising an image of a tracking target.

Optionally, the shape of the bounding box image is rectangular.

Alternatively, the tracking target may be a human face, or may be other objects, for example, a controlled tool, such as a knife or a steel pipe. Here, the number of the carbon atoms is not particularly limited.

Of course, in actual use, the tracking target may also be an animal, such as a dog or cat.

Optionally, the tracking target is pre-specified, for example, may be input by a back-end administrator, or may be defined in real time. Here, the number of the carbon atoms is not particularly limited.

Alternatively, the area of the first border area image may be 9 times that of the border frame image, i.e., the width and height of the first border area image are 3 times that of the border frame image, respectively.

Of course, in actual use, the first boundary area image may be 1 time, 2 times, 4 times, or the like of the bounding box image. Here, the number of the carbon atoms is not particularly limited.

In the implementation process, by cropping the 3-time area of the boundary frame image in the previous frame image, the situation that the target is not completely cropped due to the relatively high moving amplitude of the target in the low frame rate video can be avoided, so that the tracking target in the obtained first boundary area image is complete, and the target tracking accuracy is improved.

Step S102, obtaining a second boundary area image on the current frame image in the video stream, where the size of the second boundary area image is equal to that of the first boundary area image.

Optionally, the position of the second boundary area image in the current frame image is the same as the position of the first boundary area image in the previous frame image. Namely, the area of the current frame image and the area of the first boundary area image in the previous frame image are cut out to obtain a second boundary area image.

In the implementation process, the area of the current frame image and the first boundary area image in the previous frame image is cut out, and only one image needs to be cut out, so that the hardware resource overhead can be reduced.

Step S103, carrying out correlation degree filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates.

Optionally, the candidate coordinate is a position where the correlation degree between the first boundary area image and the second boundary area is highest.

As an embodiment, step S103 includes: inputting the first boundary area image and the second boundary area image into the same preset convolutional neural network respectively to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network; carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree; and taking the position with the maximum correlation degree as a candidate coordinate. Specifically, the first boundary region image is input into a preset Convolutional Neural Network (CNN) to obtain a first feature map (feature map1), the second boundary region image is input into the same CNN to obtain a second feature map (feature map2), and the first feature map and the second feature map are subjected to correlation filtering to obtain a score map, wherein a position with a maximum score value in the score map is used as a position of a candidate region, that is, a candidate coordinate.

Optionally, the first feature map and the second feature map are displayed in the form of a multi-dimensional vector/matrix when displayed in a data format.

For example, the first and second feature maps may be characterized by a 5 x 5 vector/matrix.

It is to be understood that the above examples are illustrative only and not limiting.

Optionally, the first feature map and the second feature map may be subjected to correlation filtering processing by a correlation filtering algorithm. For example, the correlation filtering process may be performed on the first feature map and the second feature map by using a correlation Output Sum of Squared Error filter (MOSSE) or an ASEF correlation filter (Average of Synthetic Exact Filters), so as to obtain a position where the correlation is the greatest in the first feature map and the second feature map.

Step S104, respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in the first boundary area image and the second boundary area image according to the candidate coordinates.

As an embodiment, the candidate area is determined with the candidate coordinates as the center and the bounding box image as the size; cropping the first feature map by using the candidate region to obtain a first candidate region image (feature 1); and (4) clipping the second feature map by using the candidate region to obtain a second candidate region image (feature 2).

Alternatively, the candidate region may also be referred to as a region of Interest (ROI).

Optionally, the cropping the first feature map with the candidate region to obtain a first candidate region image includes: cutting out a first image with the same size as the image of the boundary frame on the first feature map by taking the selected coordinates as the center, and adjusting the size of the first image to a preset size to obtain a first candidate area image.

The preset size refers to a preset image size.

Alternatively, the setting of the preset size may be set according to the user requirement, and is not particularly limited herein.

For example, assuming that the candidate coordinates are (x, y) and the size of the bounding box image is 3 × 3, a point with coordinates (x, y) is found on the first feature map, and then a region with a size of 3 × 3 is cut out with the point as the center, and the region is used as the first candidate region image.

The specific implementation process of the second candidate region image may refer to the specific implementation process of obtaining the first candidate region image, and is not described herein again.

Alternatively, the first candidate region image and the second candidate region image are displayed in the form of a one-dimensional vector/matrix when displayed in the data format.

For example, the first candidate region image and the second candidate region image may be characterized in the form of a vector/matrix displayed as 1 x 1024.

Step S105, determining whether the tracking target in the current frame image is the same target as the tracking target in the previous frame image, through the first candidate region image and the second candidate region image.

As an embodiment, step S105 includes: splicing the first candidate area image and the second candidate area image to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence.

Optionally, stitching refers to connecting the first candidate region image and the second candidate region image end to end. I.e. the second candidate area image is connected after the first candidate area image.

Optionally, determining the confidence corresponding to the candidate feature map includes: and inputting the candidate characteristic diagram into a full-connection layer to obtain the corresponding confidence of the candidate characteristic diagram.

Optionally, the fully-connected layer is pre-trained, and the fully-connected layer is used for calculating the confidence of the input candidate feature map.

Of course, in practical use, the confidence corresponding to the candidate feature map may also be obtained based on the global average pooling. Here, the number of the carbon atoms is not particularly limited.

Continuing with the above example, assuming that the first candidate region image and the second candidate region image are both 1 × 1024 vectors/matrices, stitching the first candidate region image and the second candidate region image means stitching the first bit data of the second candidate region image after the last bit of the first candidate region image to obtain 1 × 2048 vectors/matrices. For example, assuming that the first candidate region image is (a0, a1, a2 … … a1023), and the second candidate region image is (b0, b1, b2 … … b103), the obtained candidate feature maps after stitching are (a0, a1, a2 … … a1023, b0, b1, b2 … … b 1023).

It is to be understood that the above description is intended to be illustrative, and not restrictive.

Optionally, determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence level includes: and comparing the confidence coefficient with a preset threshold value, and judging that the tracking target in the current frame image and the tracking target in the previous frame image are the same target when the confidence coefficient is greater than or equal to the preset threshold value. On the contrary, when the confidence is smaller than a preset threshold, it is determined that the tracking target in the current frame image and the tracking target in the previous frame image are not the same target.

Alternatively, the setting of the preset threshold may be set according to the user requirement, and is not particularly limited herein.

For example, the predetermined threshold may be a fractional number, such as 0.8 or 0.9, or may be a percentage, such as 90%.

In the implementation process, the first candidate region image and the second candidate region image are spliced to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence coefficient, so as to quickly judge whether the tracking target is the same target, and further determine whether the tracking target disappears.

Of course, in actual use, the first candidate region image and the second candidate region image may not be spliced, specifically, the first confidence degree and the second confidence degree corresponding to the first candidate region image and the second candidate region image are respectively determined, and whether the tracking target in the current frame image and the tracking target in the previous frame image are the same target is determined according to the first confidence degree and the second confidence degree.

In a possible embodiment, the method further comprises: and after judging that the tracking target in the current frame image is not the same as the tracking target in the previous frame image, sending feedback information.

Optionally, the feedback information includes descriptive information for characterizing that the tracking target has disappeared.

Of course, in practical use, the feedback information may also include descriptive information for characterizing a specific reason for the disappearance of the tracking target. Here, the number of the carbon atoms is not particularly limited.

In the implementation process, by sending the feedback information, the corresponding feedback information can be given after the target disappears, so that the user can quickly know that the tracking target disappears.

In a possible embodiment, the method further comprises: and performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, wherein the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

Alternatively, the size of the bounding box image of the current frame image determined after the modification may be equal to the size of the bounding box image in the previous frame image. The size of the bounding box image of the current frame image may be smaller than the size of the bounding box image in the previous frame image.

Of course, in actual use, the size of the bounding box image of the current frame image may be larger than that of the preceding frame image. Here, the number of the carbon atoms is not particularly limited.

Alternatively, the following frame refers to a next frame in the current frame, which is temporally next, or may not be a frame that is temporally next.

Continuing with the above example as an example, assuming that the current frame is the t-th frame in the video stream, the following frame may be any frame after the t-th frame in the video stream, for example, the following frame may be a t +1 frame, a t +2 frame, or the like.

In the implementation process, the second candidate area image is modified to determine the bounding box image of the current frame image, so that the position of the tracking target in the current frame image and the size of the bounding box can be accurately obtained, an application scene of the target in the form change of video stream data can be better handled, and the tracking accuracy of the tracking target is improved. Further, the target tracking of the image of the later frame of the current frame image is determined by using the image of the boundary frame of the current frame image.

Optionally, performing modification processing on the second candidate region image to determine a bounding box image of the current frame image, includes: carrying out border frame correction processing on the second candidate area image to obtain correction parameters of a border frame, wherein the correction parameters comprise the offset of the border frame; and determining the boundary frame image of the current frame image according to the correction parameters.

Optionally, the second candidate region image may be subjected to bounding box correction processing based on a fully connected layer trained in advance, so as to obtain correction parameters of a bounding box. Specifically, the second candidate region image is input to a fully connected layer trained in advance, and the correction parameters of the bounding box are output.

For example, assuming that the bounding box is rectangular, the coordinates of the top-left corner of the bounding box in the previous frame image are (x1, y1), and the coordinates of the bottom-right corner of the bounding box in the previous frame image are (x2, y2), the second candidate region image is input into the fully-connected layer trained in advance, and the correction parameters of the output bounding box are: d _ x1, d _ y1, d _ x2, d _ y 2. The second candidate region image is corrected according to the offset, that is, the position of the bounding box of the second candidate region image in the current frame is corrected, so that the coordinates of the top left corner point of the bounding box image of the current frame image after correction are (x1+ d _ x1, y1+ d _ y1), and the coordinates of the bottom right corner point are (x2+ d _ x2, y2+ d _ y 2).

For example, at the current time, the current frame is an analysis frame of the tracking target, and at the time next to the current time, the current frame is regarded as a previous frame, and the subsequent frame is regarded as a new current frame, so that it is determined which frame is the previous frame, which frame is the current frame, and which frame is the subsequent frame according to the time node. The bounding box in each frame may change over time.

In the implementation process, by obtaining the corrected boundary frame image, the boundary frame image in the current frame can be used as a previous frame image in target tracking of a next frame during continuous tracking, and then an accurate first boundary area image can be obtained, so that target tracking of all frames in the video stream is completed by circularly executing steps S101 to S105, and in the tracking process, target tracking is completed by using the boundary frame image of the previous frame, and when tracking is completed, the position of the boundary frame image in the current frame is corrected, so that target tracking is more accurate and tracking real-time performance is improved.

In one embodiment, a plurality of fixed-size lattices (anchors) are generated around the candidate region, confidence degrees (degrees) corresponding to each anchor are determined, and the lattice corresponding to the highest confidence degree is obtained from the confidence degrees and is used as the position of the bounding box in the current frame image.

Alternatively, the size of the grid may be set according to the user requirement or the size of the tracking target, and is not particularly limited herein.

The target tracking method in the embodiment of the present application is described above with reference to fig. 1, and the following describes a human face as a tracking target, by way of example and not limitation, and the following describes the target tracking method in the embodiment of the present application in detail with reference to fig. 2 and 3. The method shown in fig. 3 comprises:

step S201, periodically detecting a human face.

Alternatively, the tracking target is determined by performing face detection on the previous frame image.

For example, a DPM (Deformable Parts Model) based target detection algorithm or a convolutional neural network based target detection algorithm is used to perform face detection and determine a tracking target.

Optionally, after the tracking target is determined, the first boundary area image including the tracking target is cropped out. The specific implementation process may refer to step S101, and is not limited herein.

Step S202, extracting features and updating a related filtering algorithm template.

Optionally, the first feature map and the second feature map are extracted from the detected face by using CNN.

Optionally, if the first feature map and the second feature map are obtained during the first face detection of the low frame rate video stream, a template of a Correlation Filter (CF) is initialized by using the first feature map and the second feature map.

Optionally, if the returned first feature map and the second feature map are the same as the tracked target in the t +1 th frame image when the tracked target in the t +1 th frame image is judged to be the same target, the returned first feature map and the returned second feature map are used for updating the related filtering algorithm template.

Optionally, the first feature map and the second feature map of the input first boundary region image and the second boundary region image are extracted by CNN, respectively.

For example, as shown in fig. 2, the same CNN is input from the t frame target area (i.e., the above first boundary area image) and the t +1 frame target area (i.e., the above second boundary area image), respectively, to obtain a first feature map and a second feature map, respectively.

In step S203, a candidate region is obtained.

For example, as shown in fig. 2, a first feature map obtained from the target area of the t-th frame and a second feature map obtained from the target area of the t + 1-th frame are input into the CF to be subjected to correlation filtering processing, so as to obtain a score map (ScoresMap), the position where the score value in the score map is the largest is taken as the position of the candidate area, that is, the candidate coordinate, and the size of the candidate area is determined by taking the candidate coordinate as the center and the size of the bounding box image in the image of the previous frame.

Namely, in the subsequent video frames, the CNN extraction features are utilized and the related filtering algorithm is combined to reposition the tracking target, so that the candidate area of the tracking target is obtained.

In step S204, the bounding box image is corrected.

Optionally, the CNN is used to correct the bounding box image according to the correction parameter obtained by the correlation filtering algorithm. The specific implementation process of the correction processing on the second candidate region image may refer to the above specific implementation process, and is not specifically limited herein.

Continuing with the above example, as shown in fig. 2, a first image having the same size as the image of the boundary frame is cut out from the first feature map with the selected coordinates as the center, the size of the first image is adjusted to a preset size to obtain a first candidate area image, a second image having the same size as the image of the boundary frame is cut out from the second feature map with the selected coordinates as the center, and the size of the second image is adjusted (roiploling, pooling of the region of interest), that is, the size of the second image is adjusted to a preset size to obtain a second feature map. The first candidate area image is followed by a Full Connection Layer (FC) for bounding box correction to determine a bounding box image of the t +1 th frame image. And splicing the first candidate area image and the second candidate area image, connecting FC to obtain a confidence coefficient, and judging whether the tracking target in the t +1 frame image and the tracking target in the t frame image are the same target or not according to the confidence coefficient.

Step S205, whether it is a tracking target.

Optionally, the confidence level is compared with a preset threshold, and when the confidence level is greater than or equal to the preset threshold, it is determined that the tracking target in the current frame image and the tracking target in the previous frame image are the same target. On the contrary, when the confidence is smaller than a preset threshold, it is determined that the tracking target in the current frame image and the tracking target in the previous frame image are not the same target.

For the specific implementation process, reference may be made to the above description, which is not repeated herein.

Optionally, if the determination result is a tracking target, that is, step S202 is executed again, that is, the tracking target in the current frame image and the tracking target in the previous frame image are the same target, extracting the feature of the target in the second candidate area image by using the modified bounding box to update the related filtering algorithm template, where the updating template can make the related filtering algorithm better cope with the morphological change of the target in the video stream data; and if the judgment result is that the target is not the tracking target, the target disappears.

And repeatedly executing the step S203 to the step S205, and when the tracking target disappears in the video, jumping to the step S202 again.

Step S206, whether the video is finished.

Alternatively, when the video is not ended, steps S201 to S205 are repeatedly performed until the video is ended.

Alternatively, the end of the video may refer to no more monitoring, or the monitoring is complete. Here, the number of the carbon atoms is not particularly limited.

In the implementation process, the network model used by the target tracking method provided by the embodiment of the application is small, and the requirement on hardware is low, so that the target tracking method is easily transplanted to some electronic equipment with resource shortage.

According to the target tracking method provided by the embodiment of the application, a first boundary area image in a previous frame image in a video stream is obtained, wherein the first boundary area image comprises a boundary frame image of the previous frame image, and the boundary frame image of the previous frame image comprises an image of a tracking target; acquiring a second boundary area image with the same size as the first boundary area image on a current frame image in the video stream, wherein the position of the second boundary area image in the current frame image is the same as the position of the first boundary area image in the previous frame image; performing relevance filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest relevance between the first boundary area image and the second boundary area image; respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in size in the first boundary area image and the second boundary area image according to the candidate coordinates; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate region image and the second candidate region image. Therefore, the method and the device can realize the technical effect of accurately tracking the target in real time in the low frame rate frequency flow on the premise of reducing the hardware resource overhead.

Referring to fig. 4, fig. 4 shows a target tracking apparatus using the target tracking method shown in fig. 1, and it should be understood that the apparatus 300 corresponds to the method embodiments shown in fig. 1 to 3, and can perform the steps related to the method embodiments, and the specific functions of the apparatus 300 can be referred to the description above, and detailed descriptions are omitted here as appropriate to avoid repetition. The device 300 includes at least one software functional module that can be stored in a memory in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the device 300. Specifically, the apparatus 300 includes:

a first acquiring unit 310, configured to acquire a first boundary area image in a previous frame image in a video stream, where the first boundary area image includes a boundary frame image of the previous frame image, and the boundary frame image of the previous frame image includes an image of a tracking target;

a second obtaining unit 320, configured to obtain a second boundary area image with a size equal to that of the first boundary area image on a current frame image in the video stream, where a position of the second boundary area image in the current frame image is the same as a position of the first boundary area image in the previous frame image;

a first processing unit 330, configured to perform correlation filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, where the candidate coordinates are positions where the correlation between the first boundary area image and the second boundary area image is the highest;

a second processing unit 340, configured to determine, according to the candidate coordinates, a first candidate region image and a second candidate region image that are equal in size to the boundary image in the first boundary region image and the second boundary region image, respectively;

a target tracking unit 350, configured to determine whether the tracking target in the current frame image is the same target as the tracking target in the previous frame image according to the first candidate region image and the second candidate region image.

In a possible embodiment, the apparatus 300 further comprises: and the position processing unit is used for performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, and the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

Optionally, the position processing unit is further configured to: carrying out border frame correction processing on the second candidate area image to obtain correction parameters of a border frame, wherein the correction parameters comprise the offset of the border frame; and determining the boundary frame image of the current frame image according to the correction parameters.

Optionally, the first processing unit 330 is further configured to: inputting the first boundary area image and the second boundary area image into the same preset convolutional neural network respectively to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network; carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree; and taking the position with the maximum correlation degree as a candidate coordinate.

Optionally, the target tracking unit 350 is further configured to: splicing the first candidate area image and the second candidate area image to obtain a spliced candidate feature map; determining the confidence degree corresponding to the candidate characteristic graph; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence.

Fig. 5 is a block diagram of an electronic device 400 in an embodiment of the present application, as shown in fig. 5. Electronic device 400 may include a processor 410, a communication interface 420, a memory 430, and at least one communication bus 440. Wherein the communication bus 440 is used to enable direct connection communication of these components. In this embodiment, the communication interface 420 of the device in this application is used for performing signaling or data communication with other node devices. The processor 410 may be an integrated circuit chip having signal processing capabilities.

The Processor 410 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 410 may be any conventional processor or the like.

The Memory 430 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 430 stores computer readable instructions, and when the computer readable instructions are executed by the processor 410, the electronic device 400 may perform the steps involved in the method embodiments of fig. 1-3.

The elements of the memory 430 and the processor 410 are electrically connected to each other, directly or indirectly, to enable data transfer or interaction. For example, these components may be electrically coupled to each other via one or more communication buses 440. The processor 410 is used to execute executable modules stored in the memory 430, such as software functional modules or computer programs included in the apparatus 300. Also, the apparatus 300 is configured to perform the following method: acquiring a first boundary area image in a previous frame image in a video stream, wherein the first boundary area image comprises a boundary frame image of the previous frame image, and the boundary frame image of the previous frame image comprises an image of a tracking target; acquiring a second boundary area image with the same size as the first boundary area image on a current frame image in the video stream, wherein the position of the second boundary area image in the current frame image is the same as the position of the first boundary area image in the previous frame image; performing relevance filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest relevance between the first boundary area image and the second boundary area image; respectively determining a first candidate area image and a second candidate area image which are equal to the boundary image in size in the first boundary area image and the second boundary area image according to the candidate coordinates; and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate region image and the second candidate region image.

Alternatively, the electronic device 500 may be a personal computer, a smart phone, a server, a computer, or the like.

It will be appreciated that the configuration shown in fig. 5 is merely illustrative and that the electronic device 400 may include more or fewer components than shown in fig. 5 or may have a different configuration than shown in fig. 5. The components shown in fig. 5 may be implemented in hardware, software, or a combination thereof.

The embodiment of the present application further provides a storage medium, where the storage medium stores instructions, and when the instructions are run on a computer, when the computer program is executed by a processor, the method in the method embodiment is implemented, and in order to avoid repetition, details are not repeated here.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by hardware, or by software plus a necessary general hardware platform, and based on such understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method of the various implementation scenarios of the present application.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Claims

1. A method of target tracking, the method comprising:

acquiring a first boundary area image in a previous frame image in a video stream, wherein the first boundary area image comprises a boundary frame image of the previous frame image, and the boundary frame image of the previous frame image comprises an image of a tracking target;

acquiring a second boundary area image with the same size as the first boundary area image on a current frame image in the video stream, wherein the position of the second boundary area image in the current frame image is the same as the position of the first boundary area image in the previous frame image;

performing relevance filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest relevance between the first boundary area image and the second boundary area image;

respectively determining a first candidate area image and a second candidate area image which are equal to the image size of the boundary frame in the first boundary area image and the second boundary area image according to the candidate coordinates;

and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate region image and the second candidate region image.

2. The method of claim 1, further comprising:

and performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, wherein the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

3. The method according to claim 2, wherein said performing modification processing on the second candidate region image to determine the bounding box image of the current frame image comprises:

carrying out border frame correction processing on the second candidate area image to obtain correction parameters of a border frame, wherein the correction parameters comprise the offset of the border frame;

and determining the boundary frame image of the current frame image according to the correction parameters.

4. The method according to claim 1, wherein the performing correlation filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates comprises:

inputting the first boundary area image and the second boundary area image into the same preset convolutional neural network respectively to obtain a first characteristic diagram and a second characteristic diagram, wherein the preset convolutional neural network is a pre-trained convolutional neural network;

carrying out correlation degree filtering processing on the first characteristic diagram and the second characteristic diagram to obtain a position with the maximum correlation degree;

and taking the position with the maximum correlation degree as a candidate coordinate.

5. The method according to claim 1, wherein the determining whether the tracking target in the current frame image is the same target as the tracking target in the previous frame image through the first candidate region image and the second candidate region image comprises:

splicing the first candidate area image and the second candidate area image to obtain a spliced candidate feature map;

determining the confidence degree corresponding to the candidate characteristic graph;

and determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image according to the confidence.

6. An object tracking device, comprising:

a first acquisition unit configured to acquire a first boundary area image in a preceding frame image in a video stream, the first boundary area image including a boundary frame image of the preceding frame image, the boundary frame image of the preceding frame image including an image of a tracking target;

a second obtaining unit, configured to obtain a second boundary area image that is equal to the first boundary area image in size on a current frame image in the video stream, where a position of the second boundary area image in the current frame image is the same as a position of the first boundary area image in the previous frame image;

the first processing unit is used for carrying out correlation degree filtering processing on the first boundary area image and the second boundary area image to obtain candidate coordinates, wherein the candidate coordinates are positions with the highest correlation degree between the first boundary area image and the second boundary area image;

the second processing unit is used for respectively determining a first candidate area image and a second candidate area image which are equal to the image size of the boundary frame in the first boundary area image and the second boundary area image according to the candidate coordinates;

and the target tracking unit is used for determining whether the tracking target in the current frame image is the same as the tracking target in the previous frame image or not through the first candidate area image and the second candidate area image.

7. The apparatus of claim 6, further comprising:

and the position processing unit is used for performing modification processing on the second candidate area image to determine a boundary frame image of the current frame image, and the boundary frame image of the current frame image is used for determining target tracking of a subsequent frame image of the current frame image.

8. The apparatus of claim 6, wherein the first processing unit is further configured to:

9. An electronic device, comprising: memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the object tracking method according to any one of claims 1 to 5 when executing the computer program.

10. A storage medium storing instructions that, when executed on a computer, cause the computer to perform the target tracking method of any one of claims 1 to 5.