CN114219828A

CN114219828A - Target association method and device based on video and readable storage medium

Info

Publication number: CN114219828A
Application number: CN202111296373.4A
Authority: CN
Inventors: 周经纬; 潘华东; 殷俊; 李中振; 巩海军
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-11-03
Filing date: 2021-11-03
Publication date: 2022-03-22

Abstract

The application discloses a video-based target association method, a video-based target association device and a readable storage medium, wherein the video-based target association method comprises the following steps: acquiring a current frame image of a monitoring video; carrying out target detection on a target to be tracked in a current frame image to obtain first target detection information of the target to be tracked; determining each target to be matched in the historical frame image based on the position information of the target to be tracked in the current frame image; the target to be matched comprises a target in a candidate matching area in the historical frame image, and the candidate matching area is determined based on the position information of the target to be tracked; the historical frame image is an image before the current frame image in the monitoring video; and determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched. By the mode, the target association can be intelligently carried out in real time.

Description

Target association method and device based on video and readable storage medium

Technical Field

The application relates to the technical field of computer vision, in particular to a video-based target association method and device and a readable storage medium.

Background

The application scenes of the current tracking system are more and more extensive, the current tracking system can be applied to battlefield detection or target tracking in the military field, crime early warning or suspect positioning in the security field, and can be applied to violation detection or automatic driving in the urban traffic field, so that the current tracking system becomes a research hotspot, but the current tracking system can only realize single tracking positioning and cannot realize intelligent detection on the aspects of target category analysis or behavior analysis and the like.

Disclosure of Invention

The application provides a video-based target association method, a video-based target association device and a readable storage medium, which can intelligently associate targets in real time.

In order to solve the technical problem, the technical scheme adopted by the application is as follows: a video-based target association method is provided, which comprises the following steps: acquiring a current frame image of a monitoring video; carrying out target detection on a target to be tracked in a current frame image to obtain first target detection information of the target to be tracked; determining each target to be matched in the historical frame image based on the position information of the target to be tracked in the current frame image; the target to be matched comprises a target in a candidate matching area in the historical frame image, and the candidate matching area is determined based on the position information of the target to be tracked; the historical frame image is an image before the current frame image in the monitoring video; and determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a real-time object tracking device comprising a memory and a processor connected to each other, wherein the memory is used for storing a computer program, and the computer program, when executed by the processor, is used for implementing the video-based object association method in the above technical solution.

In order to solve the above technical problem, the present application adopts another technical solution: there is provided a computer readable storage medium for storing a computer program for implementing the video-based object association method of the above technical solution when the computer program is executed by a processor.

Through the scheme, the beneficial effects of the application are that: the method comprises the steps of firstly obtaining a current frame image of a monitoring video, then carrying out target detection on a target to be tracked in the current frame image to obtain first target detection information of the target to be tracked, then determining each target to be matched in a historical frame image based on position information of the target to be tracked in the current frame image, and finally determining the target to be matched associated with the target to be tracked based on the similarity between second target detection information of each target to be matched and the first target detection information; by the scheme, target detection information of the target to be tracked can be obtained, intelligent detection and analysis of the target are realized, and meanwhile, target matching can be performed on the target to be tracked in the historical frame image by utilizing the target detection information (including the first target detection information and the second target detection information) so as to realize target association and further complete target tracking; in addition, the target detection analysis and the target tracking are combined, so that the accuracy of the target tracking can be further improved, and the method is more intelligent.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a video-based object association method provided herein;

FIG. 2 is a schematic flow chart diagram illustrating another embodiment of a video-based object association method provided herein;

FIG. 3 is a schematic flow chart illustrating the narrowing of a target detection frame of a non-key frame image according to the present application;

FIG. 4 is a schematic flow chart of step 27 provided herein;

FIG. 5 is a schematic diagram of an embodiment of a real-time target tracking device provided herein;

FIG. 6 is a schematic structural diagram of an embodiment of a computer-readable storage medium provided in the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be noted that the following examples are only illustrative of the present application, and do not limit the scope of the present application. Likewise, the following examples are only some examples and not all examples of the present application, and all other examples obtained by a person of ordinary skill in the art without any inventive work are within the scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

It should be noted that the terms "first", "second" and "third" in the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of indicated technical features. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an embodiment of a video-based target association method provided in the present application, where the method includes:

step 11: and acquiring a current frame image in the monitoring video.

Monitoring a monitoring area through camera equipment such as a gun type camera, a dome camera or a panoramic camera, and obtaining a monitoring video; specifically, the camera device can be installed in a top-mounted or inclined-mounted manner, the inclined-mounted manner of the camera device can be selected when the installation height is greater than 2.5 meters, and the top-mounted manner of the camera device can be selected when the installation height is less than 2.5 meters, so that a monitoring picture acquired by the camera device is complete, and a target in a current frame image can be tracked subsequently.

Step 12: and carrying out target detection on the target to be tracked in the current frame image to obtain first target detection information of the target to be tracked.

The target to be tracked can be a human body object, namely, the current frame image obtained from the monitoring video can contain the human body object, and then the target detection is carried out on the human body object in the current frame image; specifically, the number of the targets to be tracked in the current frame image may be one, two, three or more, and target detection is performed on all the targets to be tracked in the current frame image, so as to obtain first target detection information corresponding to all the targets to be tracked; it can be understood that, the current frame image may be input into the target detection model, so as to perform target detection by using the target detection model, obtain target detection frames corresponding to all targets to be tracked, and then perform detection processing on the target detection frames, so as to obtain first target detection information of the targets to be tracked, where the target detection model may be a YOLOX model or a CentNet model, etc.

In a specific embodiment, a user may set a region to be detected in a current frame image to specify a target detection range, that is, only a target in a preset region to be detected is detected when the target is detected; for example: camera equipment monitors the hotel lobby, can acquire the control picture that contains the hotel proscenium, hotel gate and hotel wait for the guest district, the user is if only want to carry out the target tracking to the pedestrian that the hotel gate came and came, then alright be with the regional regulation at hotel gate for waiting to detect the region this moment, subsequent target detection and target tracking are all waited to detect and are gone on in the region at this, do not carry out target detection and target tracking to other regions except that waiting to detect the region, thereby can make pointed references to carry out target detection and tracking, also can save unnecessary work load simultaneously, save computer resources, further improve work efficiency.

Step 13: and determining each target to be matched in the historical frame image based on the position information of the target to be tracked in the current frame image.

The historical frame image is an image before a current frame image in the monitoring video, and can be a previous frame image of the current frame image under the general condition, but can be a plurality of previous frame images of the current frame image when the condition similar to frame skipping occurs; specifically, the target to be matched includes a target in a candidate matching region in the historical frame image, and the candidate matching region is determined based on the position information of the target to be tracked.

In a specific embodiment, the candidate matching area may be determined based on the current position information of the target to be tracked in the current frame image, but is not limited to the range of a circle/polygon with the current position of the target to be tracked in the current frame image as the center of the circle, and the target in the candidate matching area in the historical frame image is taken as the target to be matched, so that the target to be matched and the target to be tracked in the current frame image are subjected to target matching; it is understood that the diameter size of the candidate matching area can be set according to actual situations.

Step 14: and determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched.

The similarity comparison can be carried out on the second target detection information and the first target detection information of each target to be matched to obtain the similarity of the second target detection information and the first target detection information of each target to be matched, so that the target to be tracked is matched with the target to be matched in the historical frame image, the target to be tracked is successfully matched with the target to be matched, namely the target to be tracked is matched with the same target of the current frame image and the historical frame image, and the tracking result of the target to be tracked is obtained and can comprise the second target detection information of the historical frame image and the first target detection information of the current frame image; specifically, when a plurality of targets to be tracked in the current frame image are provided, the plurality of targets to be tracked can be matched at the same time to obtain respective tracking results.

In a specific embodiment, the first target detection information may include category information and key point information of the target to be tracked, the category information is information of a category of the target to be tracked, the category information may include information of age or gender of the target to be tracked, and the age information may be a specific numerical value or a range; the key point information is information of a key point of the target to be tracked, and the key point of the target to be tracked (i.e. the human body object) corresponds to a joint or a part of the human body, for example: a head, an elbow joint, a shoulder joint, a knee, or the like, which generally performs a key point detection process on a certain human body object and can output 15-17 human body key points; it is to be understood that the second object detection information may include category information and key point information of the object to be matched.

In this embodiment, a current frame image of a surveillance video is obtained first, then target detection is performed on a target to be tracked in the current frame image to obtain first target detection information of the target to be tracked, then each target to be matched in a historical frame image is determined based on position information of the target to be tracked in the current frame image, and finally the target to be matched associated with the target to be tracked is determined based on similarity between second target detection information of each target to be matched and the first target detection information; by the scheme, target detection information of the target to be tracked can be obtained, intelligent detection and analysis of the target are realized, and meanwhile, target matching can be performed on the target to be tracked in the historical frame image by utilizing the target detection information so as to realize target association and further complete target tracking; in addition, the target detection analysis and the target tracking are combined, so that the accuracy of the target tracking can be further improved, and the method is more intelligent.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating another embodiment of a video-based object association method provided in the present application, the method including:

step 21: and acquiring a current frame image in the monitoring video.

This step 21 is the same as step 11 in the above embodiment, and will not be described again.

Step 22: and judging whether the current frame image is a key frame image.

Key frame images can be set in the monitoring video, part of the frame images of the monitoring video are selected as the key frame images, and the rest images are non-key frame images; when the target is detected, only the key frame image is subjected to target detection processing, and target detection of non-key frame images is omitted, so that the target detection time is saved, and the working efficiency is improved. It is understood that there is only a process of distinguishing the key frame image from the non-key frame image in the target detection step, and both the key frame image and the non-key frame image are performed in the subsequent operations of classification, key point detection, and the like.

In a specific embodiment, whether the current frame image is the key frame image can be judged by judging whether the frame number of the current frame image is in a preset value set; if the frame number of the current frame image is in a preset value set, determining the current frame image as a key frame image; further, the preset value set may be set by a user selection, for example, the user selection setting starts from the first frame image, every four frames is a key frame image, and the preset value set may include 1, 6, 11, 16, and so on; it can be understood that, in order to ensure that the non-key frame image can set the target detection frame according to the target detection frame of the historical frame image, no matter how many interval frames of the key frame image are set by the user, the first frame image in the monitoring video is the key frame image, that is, the preset value set must include 1.

Further, in addition to the above-mentioned judgment of whether the current frame image is a key frame image according to the preset value set, the judgment can be made according to a real-time target tracking result, the current frame is marked as an nth frame, if a new target appears in the (N-1) th frame image compared with the (N-2) th frame image, or if the (N-1) th frame image is lost as an old target compared with the (N-2) th frame image, the nth image is judged as a key frame image, thereby ensuring that the new target can be accurately detected in time; for example: the (N-2) th frame image only contains the object A, B, C, and a new object D appears in the (N-1) th image, and then in order to accurately detect the object D, the N-th frame image is marked as a key frame image, so that the key frame image is subjected to object detection.

Step 23: and if the current frame image is the key frame image, processing the current frame image by adopting a target detection model to obtain a target detection frame.

When the current frame image is the key frame image, the current frame image is processed by adopting a target detection model to obtain target detection frames corresponding to all targets to be tracked in the current frame image, wherein the target detection model can be a YOLOX model or a CentNet model and the like.

Step 24: and if the current frame image is a non-key frame image, determining a target detection frame based on the position of the target to be matched in the historical frame image.

When the current frame image is a non-key frame image, determining a target detection frame of each target to be tracked according to the position of the target to be matched in the historical frame image; specifically, a target to be matched which is closest to the current target to be tracked is found in the historical frame image, then a target detection frame of the target to be matched is expanded by a preset multiple (for example, 1.2-2 times), namely, the length and the width of the target detection frame are simultaneously expanded by the same multiple, and the expanded detection frame is used as the target detection frame of the current target to be tracked.

It can be understood that when a plurality of targets to be matched are located at positions close to the current target to be tracked in the historical frame image and it cannot be determined which distance is the closest, the target detection frames of the plurality of targets to be matched are expanded by the same multiple, then the union region of all the expanded target detection frames is used as the target detection frame, and then the target detection frame is processed to obtain the first target detection information of the target to be tracked, wherein the first target detection information includes category information and key point information.

Step 25: and processing the target detection frame by adopting a classification model to obtain the class information of the target to be matched and the class information of the target to be tracked.

Processing the target detection frame by adopting a classification model to obtain the class information of the target to be matched and the class information of the target to be tracked; specifically, when the number of the targets to be tracked is multiple, the classification models may be input to the target detection frames of all the targets to be tracked at the same time to obtain category information of all the targets to be tracked, where the category information may include information such as age or gender.

Step 26: and processing the target detection frame by adopting a key point detection model to obtain key point information of the target to be matched and key point information of the target to be tracked.

Processing the target detection frame by adopting a key point detection model to obtain key point information of a target to be matched and key point information of the target to be tracked; specifically, when the number of the targets to be tracked is multiple, the target detection frames of all the targets to be tracked can be simultaneously input into the key point detection model to obtain the key point information of all the targets to be tracked; further, the key point information includes all key points of the target to be tracked, and the key points correspond to joints or parts of the human body, for example: head, elbow, shoulder or knee, etc.

In a specific embodiment, the target detection information further includes key point position information, and after the category information and the key point information of the target to be tracked are obtained when the current frame image is a non-key frame image, the target detection frame can be limited based on the key point position information to obtain a more accurate target detection frame, so that when the next frame image is still a non-key frame image, the target detection frame can be generated according to the limited target detection frame of the current frame image, and a basis is provided for subsequent similarity calculation.

Specifically, the position detection module may perform position detection processing on the key point information of the target to be tracked and the key point information of the target to be matched, respectively, to obtain the key point position information of the target to be tracked and the key point position information of the target to be matched, where the key point position information may include a position coordinate of each key point, and the position coordinate includes an abscissa value and an ordinate value, for example: the position coordinates (2, 5) of a certain key point show that the abscissa value is 2 and the ordinate value is 5; further, the position detection model may be a Graph Convolutional neural network (GCNs), that is, the position of the keypoint information may be detected by the Graph Convolutional neural network, so as to obtain the position information of the keypoint, and then the target detection frame is narrowed based on the position information of the keypoint, and the specific steps are as shown in fig. 3:

step 31: and acquiring the key point with the minimum ordinate value and the key point with the maximum ordinate value in all the key points based on the position information of the key points.

Finding out the key point with the minimum vertical coordinate and the key point with the maximum vertical coordinate according to the position coordinate of each key point, for example: when the target to be tracked is in a normal standing state, the key point with the minimum longitudinal coordinate value at the moment can be the key point positioned on the foot, and the key point with the maximum longitudinal coordinate value can be the key point positioned on the head.

Step 32: and acquiring the key point with the minimum abscissa value and the key point with the maximum abscissa value in all the key points based on the key point position information.

When the target to be tracked is in a standing state of the inserting pocket, the corresponding key point with the minimum abscissa value is the key point located on the elbow joint on one side, and the key point with the maximum abscissa value is the key point located on the elbow joint on the other side.

Step 33: and acquiring a key point with the minimum vertical coordinate value, a key point with the maximum vertical coordinate value, a key point with the minimum horizontal coordinate value and a circumscribed rectangle of the key point with the maximum horizontal coordinate value to obtain the target detection frame of the current frame image.

By obtaining the key point with the minimum vertical coordinate value, the key point with the maximum vertical coordinate value, the key point with the minimum horizontal coordinate value and the external rectangle of the key point with the maximum horizontal coordinate value, the minimum rectangle containing all the key points in the target to be tracked can be formed, and the rectangle is used as the target detection frame of the target to be tracked, so that the limitation of the target detection frame is completed, and the more accurate target detection frame is obtained.

Step 27: and comparing the similarity of the target to be tracked and each target to be matched based on the category information of the target to be matched, the key point information of the target to be matched, the category information of the target to be tracked and the key point information of the target to be tracked to obtain corresponding similarity.

After the classification model and the key point detection model are adopted to respectively obtain the target to be matched, the category information of the target to be tracked and the key point information, similarity comparison is carried out on the target to be tracked and each target to be matched according to the category information and the key point information, and the similarity corresponding to the target to be tracked and each target to be matched is obtained; it can be understood that the similarity is used to represent the matching degree between the target to be tracked and the target to be matched, and a higher similarity indicates a higher matching degree between the target to be tracked and the target to be matched.

The following describes in detail the steps of comparing the similarity of the target to be tracked and each target to be tracked based on the category information of the target to be matched, the key point information of the target to be matched, the category information of the target to be tracked, and the key point information of the target to be tracked, and obtaining the corresponding similarity, as shown in fig. 4:

step 41: and comparing the matching degree of the target to be tracked and each target to be tracked based on the category information of the target to be tracked and the category information of the target to be tracked to obtain category similarity.

The category information comprises categories of the target to be tracked and feature vectors corresponding to the categories, the deviation between the feature vectors of the target to be tracked and the feature vectors of the target to be matched is calculated to obtain a deviation value, and then the deviation value is matched with a preset deviation scoring table to obtain category similarity; specifically, the preset deviation score table may be set according to an actual situation, and may include a deviation value and a corresponding category similarity, and the corresponding category similarity may be obtained by matching the calculated deviation value with the preset deviation score table.

It can be understood that the category information includes a plurality of pieces of information such as age or gender, and then the deviation calculation is performed on the feature vectors corresponding to the same category of the target to be tracked and the target to be matched, and then the sub-deviation value and the corresponding sub-category similarity corresponding to each category are obtained, at this time, in order to obtain the category similarity capable of representing the overall category matching degree of the target to be tracked and the target to be matched, all the sub-category similarities are summed up and then an average value is obtained, so that the calculated average value is used as the final category similarity. For example: calculating the deviation between the age characteristic vectors of the target to be tracked and the target to be matched to obtain a deviation value, calculating the deviation between the gender characteristic vectors of the target to be tracked and the target to be matched to obtain another deviation value, then matching each deviation value with a preset deviation scoring table to obtain corresponding sub-category similarity, then summing all the sub-category similarities, and then solving an average value to obtain the final category similarity.

Step 42: and comparing the positions of the target to be tracked and each target to be tracked based on the target detection frame of the target to be matched and the target detection frame of the target to be tracked to obtain the spatial similarity.

The method comprises the steps of judging the space position change condition of a target to be tracked and a target to be matched in a mode of calculating the Intersection ratio between a target detection frame of the target to be tracked and a target detection frame of the target to be matched, calculating the Intersection Over Unit (IOU) between the target detection frame of the target to be tracked and the target detection frame of the target to be matched to obtain the current Intersection ratio, and then matching the current Intersection over Unit with a preset space scoring table to obtain the space similarity; specifically, the intersection area of the target detection frame in the history frame image and the target detection frame in the current frame image is divided by the union area, and the current intersection ratio can be obtained. It can be understood that, when the current frame image is a non-key frame image, the target detection frame generated after the reduction processing is used to perform the calculation of the spatial similarity.

Further, the preset space scoring table may be set according to an actual situation, which may include a cross-over ratio and a corresponding space similarity, and the corresponding space similarity may be obtained by matching the current cross-over ratio obtained by the calculation with the preset space scoring table.

Step 43: and comparing the postures of the target to be tracked and each target to be tracked based on the key point information of the target to be matched and the key point information of the target to be tracked to obtain the posture similarity.

The target detection information also comprises posture information, and the posture information is used for representing the posture of the human body object and can comprise postures such as standing posture, sitting posture or lying posture; the pose alignment in this embodiment may include the following two ways:

1) and performing attitude comparison processing on the key point position information of the target to be tracked and the key point position information of the target to be matched based on the attitude comparison model to obtain the similarity between the attitude information and the attitude.

The posture comparison model can be a twin neural network (Simense neural network), and the specific structure and the working principle of the twin neural network are the same as those in the related technology, and are not described again; the position information of the key points of the target to be tracked and the position information of the key points of the target to be matched can be input into the twin neural network, so that the posture information of the target to be tracked and the posture similarity between the target to be tracked and the target to be matched can be obtained.

2) Selecting a first reference key point from all key points of the target to be tracked, selecting a second reference key point from the target to be matched, aligning the first reference key point with the second reference key point, calculating the distance between the remaining key points in the target to be tracked and the corresponding remaining key points in the target to be matched to obtain a posture deviation value, and matching the posture deviation value with a preset posture scoring table to obtain the posture similarity.

The second reference key point and the first reference key point are located at the same position, and the first reference key point and the second reference key point located on the head may be generally selected, that is, the target to be tracked and the key points located on the head of the target to be matched are aligned, and then the distances between the target to be tracked and the remaining key points at the corresponding position in the target to be matched are respectively calculated, for example: taking the key points positioned at the head as the first reference key point and the second reference key point, respectively calculating the distance between the hand key point of the target to be tracked and the hand key point of the target to be matched, calculating the distance between the foot key point of the target to be tracked and the hand key point of the target to be matched, and the like, without doing examples one by one, so as to obtain sub-attitude deviation values corresponding to the key points of each part, summing and averaging all the sub-attitude deviation values, so as to obtain a final attitude deviation value, and then matching the attitude deviation value with a preset attitude scoring table, so as to obtain the attitude similarity.

Further, the distance between the key points can be calculated by calculation methods such as Euclidean distance calculation or cosine distance calculation, and the distance calculation method is not limited herein; the preset posture score table may be set according to an actual situation, and may include a posture deviation value and a corresponding posture similarity.

In other embodiments, the above two schemes may also be combined to determine the pose similarity, such as: and carrying out weighted summation on the attitude similarity acquired by the two schemes to serve as the final attitude similarity.

Step 44: and generating similarity based on the category similarity, the spatial similarity and the posture similarity.

The category similarity, the spatial similarity and the attitude similarity can be subjected to weighted summation processing to obtain the similarity, and the similarity is matched with the target based on the similarity between the target to be tracked and each target to be matched, so that the tracking result is determined.

Step 28: and judging whether the target to be matched with the similarity larger than a preset score threshold exists in all the targets to be matched.

The preset scoring threshold value can be set according to actual conditions, such as: and setting a preset score threshold value to be 0.8, and when the similarity of the target to be matched is greater than 0.8, indicating that the target to be matched is matched with the target to be tracked.

Step 29: and if the target to be matched with the similarity larger than the preset score threshold exists in all the targets to be matched, associating the target to be matched with the maximum similarity with the target to be tracked.

When a plurality of targets to be matched with similarity greater than a preset score threshold value are available, selecting a target which is most matched with the target to be tracked from the plurality of targets to be matched, and associating the target to be matched with the maximum similarity with the target to be tracked, namely determining that the target to be matched with the maximum similarity is successfully matched with the target to be tracked, so as to obtain a tracking result of the target to be tracked; specifically, the tracking result may include second target detection information of the historical frame image and first target detection information of the current frame image, and the tracking result for a certain target to be tracked is the first target detection information of the target to be tracked and the second target detection information of the target to be matched in the matched historical frame image, and the target detection information may include category information, key point position information, posture information, and the like.

In a specific embodiment, the target detection information further includes Identity Information (ID), and when the target to be tracked and the target to be matched are successfully matched, the Identity information of the target to be matched is determined as the Identity information of the target to be tracked, that is, the target ID of the target to be matched is assigned to the target to be tracked; when the target to be tracked is not successfully matched with the target to be matched, a new target ID is allocated to the target to be tracked; specifically, when the target to be tracked is not successfully matched with all the targets to be matched in the historical frame image, that is, the target to be tracked is a newly appeared target, a new target ID may be allocated to the target from a preset target ID database at this time.

In this embodiment, based on whether the current frame image is a key frame image, a corresponding manner for obtaining a target detection frame is selected, and a target detection model may be used to perform target detection on the key frame image, or a target detection frame of a historical frame image is used to define a target detection frame for a non-key frame image, so that the target detection accuracy is ensured, the target detection time is saved, and the working efficiency is improved; the target detection frame is processed to obtain corresponding category information and key point information, then target matching is carried out by utilizing the category information and the key point information of the historical frame image and the current frame image, similarity evaluation is respectively carried out from three aspects of category similarity, space similarity and posture similarity, and target matching is carried out through the similarity, so that the matching result is more accurate, and target tracking is more accurate; and meanwhile, various tracking results can be generated while the target is accurately tracked, target detection information including category information, key point information, target ID, attitude information and the like is obtained, and the combination of target detection analysis and target tracking is realized, so that the target tracking is more intelligent and comprehensive.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an embodiment of the real-time object tracking device provided in the present application, in which the real-time object tracking device 50 includes a memory 51 and a processor 52 connected to each other, the memory 51 is used for storing a computer program, and the computer program is used for implementing the video-based object association method in the foregoing embodiment when being executed by the processor 52.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an embodiment of a computer-readable storage medium 60 provided by the present application, where the computer-readable storage medium 61 is used for storing a computer program 61, and when the computer program 61 is executed by a processor, the computer program is used for implementing the video-based object association method in the foregoing embodiment.

The computer-readable storage medium 60 may be a server, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various media capable of storing program codes.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules or units is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings, or which are directly or indirectly applied to other related technical fields, are intended to be included within the scope of the present application.

Claims

1. A video-based target association method is characterized by comprising the following steps:

acquiring a current frame image of a monitoring video;

performing target detection on the target to be tracked in the current frame image to obtain first target detection information of the target to be tracked; and

determining each target to be matched in the historical frame image based on the position information of the target to be tracked in the current frame image; the target to be matched comprises a target in a candidate matching area in a historical frame image, and the candidate matching area is determined based on the position information of the target to be tracked; the historical frame image is an image before the current frame image in the monitoring video;

and determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched.

2. The video-based target association method according to claim 1, wherein the step of determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched comprises:

judging whether the target to be matched with the similarity larger than a preset grading threshold exists in all the targets to be matched;

and if so, associating the target to be matched with the maximum similarity with the target to be tracked.

3. The video-based target association method according to claim 1, wherein the step of determining the target to be matched associated with the target to be tracked is followed by:

and determining the identity identification information of the target to be matched as the identity identification information of the target to be tracked.

4. The video-based target association method according to claim 1, wherein the first target detection information includes category information and key point information of the target to be tracked; the second target detection information comprises category information and key point information of the target to be matched; the step of determining the target to be matched associated with the target to be tracked based on the similarity between the second target detection information and the first target detection information of each target to be matched comprises the following steps:

comparing the similarity of the target to be tracked and each target to be matched based on the category information of the target to be matched, the key point information of the target to be matched, the category information of the target to be tracked and the key point information of the target to be tracked to obtain corresponding similarity;

and determining the target to be matched associated with the target to be tracked based on the similarity.

5. The video-based object association method of claim 4,

the step of comparing the similarity of the target to be tracked and each target to be tracked based on the category information of the target to be matched, the key point information of the target to be matched, the category information of the target to be tracked and the key point information of the target to be tracked to obtain the corresponding similarity comprises the following steps:

based on the category information of the target to be tracked and the category information of the target to be tracked, comparing the matching degree of the target to be tracked and each target to be tracked to obtain category similarity;

based on the target detection frame of the target to be matched and the target detection frame of the target to be tracked, performing position comparison on the target to be tracked and each target to be matched to obtain spatial similarity;

comparing the postures of the target to be tracked and each target to be tracked based on the key point information of the target to be matched and the key point information of the target to be tracked to obtain posture similarity;

generating the similarity based on the category similarity, the spatial similarity, and the pose similarity.

6. The video-based target association method according to claim 5, wherein the step of generating the similarity based on the category similarity, the spatial similarity and the pose similarity comprises:

and carrying out weighted summation processing on the category similarity, the space similarity and the posture similarity to obtain the similarity.

7. The video-based target association method according to claim 5, wherein the category information includes a category of the target to be tracked and a feature vector corresponding to the category, and the step of comparing the matching degree between the target to be tracked and each target to be tracked based on the category information of the target to be matched and the category information of the target to be tracked to obtain the category similarity includes:

calculating the deviation between the feature vector of the target to be tracked and the feature vector of the target to be matched to obtain a deviation value;

and matching the deviation value with a preset deviation scoring table to obtain the category similarity.

8. The video-based target association method according to claim 5, wherein the step of comparing the positions of the target to be tracked and each target to be matched based on the target detection frame of the target to be matched and the target detection frame of the target to be tracked to obtain the spatial similarity comprises:

calculating the intersection ratio between the target detection frame of the target to be tracked and the target detection frame of the target to be matched to obtain the current intersection ratio;

and matching the current intersection ratio with a preset space scoring table to obtain the space similarity.

9. The video-based target association method according to claim 5, wherein the step of comparing the postures of the target to be tracked and each target to be tracked based on the key point information of the target to be matched and the key point information of the target to be tracked to obtain the posture similarity comprises:

based on a position detection model, respectively carrying out position detection processing on the key point information of the target to be tracked and the key point information of the target to be matched to obtain the key point position information of the target to be tracked and the key point position information of the target to be matched;

and based on a posture comparison model, carrying out posture comparison processing on the key point position information of the target to be tracked and the key point position information of the target to be matched to obtain the posture information and the posture similarity.

10. The video-based target association method according to claim 5, wherein the step of comparing the postures of the target to be tracked and each target to be tracked based on the key point information of the target to be matched and the key point information of the target to be tracked to obtain the posture similarity further comprises:

selecting a first reference key point from all key points of the target to be tracked, and selecting a second reference key point from the target to be matched, wherein the second reference key point and the first reference key point are positioned at the same part;

aligning the first reference key point and the second reference key point, and calculating the distance between the remaining key points in the target to be tracked and the corresponding remaining key points in the target to be matched to obtain a posture deviation value;

and matching the attitude deviation value with a preset attitude score table to obtain the attitude similarity.

11. The video-based target association method according to claim 1, wherein before the step of performing target detection on the target to be tracked in the current frame image to obtain the first target detection information of the target to be tracked, the method includes:

judging whether the current frame image is a key frame image;

if so, processing the current frame image by adopting a target detection model to obtain a target detection frame;

if not, determining a target detection frame based on the position of the target to be matched in the historical frame image;

processing the target detection frame by adopting a classification model to obtain the category information of the target to be matched and the category information of the target to be tracked;

and processing the target detection frame by adopting a key point detection model to obtain the key point information of the target to be matched and the key point information of the target to be tracked.

12. The video-based target association method of claim 11, wherein the step of determining whether the current frame image is a key frame image comprises:

judging whether the frame number of the current frame image is in a preset value set or not;

and if so, determining the current frame image as a key frame image.

13. The video-based target association method of claim 11, further comprising:

and when the current frame image is not a key frame image, limiting the target detection frame based on the key point position information of the target to be tracked.

14. The video-based target association method according to claim 13, wherein the key point position information includes a position coordinate of each key point, the position coordinate includes an abscissa value and an ordinate value, and the step of narrowing the target detection box based on the key point position information includes:

acquiring a key point with the minimum longitudinal coordinate value and a key point with the maximum longitudinal coordinate value in all the key points based on the position information of the key points;

acquiring a key point with the minimum abscissa value and a key point with the maximum abscissa value from all the key points based on the position information of the key points;

and acquiring the key point with the minimum vertical coordinate value, the key point with the maximum vertical coordinate value, the key point with the minimum horizontal coordinate value and the circumscribed rectangle of the key point with the maximum horizontal coordinate value to obtain the target detection frame of the current frame image.

15. A real-time object tracking apparatus, comprising a memory and a processor connected to each other, wherein the memory is configured to store a computer program, which when executed by the processor is configured to implement the video-based object association method of any one of claims 1-14.

16. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, is adapted to carry out the video-based object association method of any one of claims 1-14.