CN112613418A

CN112613418A - Parking lot management and control method and device based on target activity prediction and electronic equipment

Info

Publication number: CN112613418A
Application number: CN202011570369.8A
Authority: CN
Inventors: 王颖
Original assignee: Xian Cresun Innovation Technology Co Ltd
Current assignee: Xian Cresun Innovation Technology Co Ltd
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-06

Abstract

The invention discloses a parking lot management and control method and device based on target activity prediction, an electronic device and a storage medium, wherein the parking lot management and control method and device based on target activity prediction comprise the following steps: acquiring a scene video aiming at a parking lot; detecting and tracking a target in a scene video to generate a space and/or map model of a parking lot; the space and OR graph model represents the space position relation of the target in the scene video; obtaining a sub-activity label set representing the activity state of the attention target by utilizing a sub-activity extraction algorithm on the space and or graph model; inputting the sub-activity label set into a time and OR model obtained in advance to obtain a prediction result of the future activity of the concerned target in the parking lot; the time and/or graph model is obtained by utilizing a pre-established active corpus of targets of the parking lot; and sending control information to the corresponding equipment of the parking lot based on the prediction result of the future activity of the attention target. The invention can realize accurate and rapid prediction of the target movement in the parking lot, thereby realizing the purpose of effectively managing and controlling the parking lot.

Description

Parking lot management and control method and device based on target activity prediction and electronic equipment

Technical Field

The invention belongs to the field of parking lot management, and particularly relates to a parking lot management and control method and device based on target activity prediction and electronic equipment.

Background

Parking lots are crowded places and motor vehicles. To ensure safety and implement effective management, the condition of the parking lot is generally monitored using a video monitoring apparatus.

By detecting and analyzing the monitoring video, the current behavior of each target in the parking lot can be obtained. However, this detection technique belongs to post-detection, and cannot predict the movement of the target at a future time, so that the future movement cannot be responded to in time, or the occurrence of a safety event, such as a vehicle collision, cannot be avoided in time. Therefore, effective management and control of the parking lot cannot be realized.

Disclosure of Invention

The embodiment of the invention aims to provide a parking lot management and control method and device based on target activity prediction, an electronic device and a storage medium, so that the target activity in a parking lot can be accurately and quickly predicted, and the parking lot can be effectively managed and controlled. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a parking lot management and control method based on target activity prediction, where the method includes:

acquiring a scene video aiming at a parking lot;

generating a space and or map model of the parking lot by detecting and tracking a target in the scene video; wherein the spatial and OR graph model represents the spatial position relation of the target in the scene video;

obtaining a sub-activity label set representing the activity state of the attention target by using a sub-activity extraction algorithm on the space and or graph model;

inputting the sub-activity label set into a time and OR graph model obtained in advance to obtain a prediction result of the future activity of the concerned target in the parking lot; the time and/or map model is obtained by utilizing a pre-established active corpus of targets of the parking lot;

and sending control information to corresponding equipment of the parking lot based on the prediction result of the future activity of the attention target.

Optionally, the generating a space and/or map model of the parking lot by detecting and tracking the target in the scene video includes:

detecting the targets in the scene video by using a target detection network obtained by pre-training to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the target;

matching the same target in each frame of image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

determining the actual space distance between different targets in each frame of image;

and generating a space and OR model of the parking lot by using the attribute information of the target corresponding to each matched frame image and the actual space distance.

Optionally, the target detection network comprises a YOLO _ v3 network; the preset multi-target tracking algorithm comprises a Deepsort algorithm.

Optionally, the determining the actual spatial distance between different targets in each frame of image includes:

in each frame image, determining the pixel coordinate of each target;

calculating the corresponding actual coordinates of the pixel coordinates of each target in a world coordinate system by using a monocular vision positioning and ranging technology;

and aiming at each frame image, obtaining the actual space distance between every two targets in the frame image by using the actual coordinates of every two targets in the frame image.

Optionally, the obtaining, by using a sub-activity extraction algorithm, a sub-activity label set representing an activity state of the attention target for the space and or graph model includes:

determining the paired targets in the space and or image model, of which the actual space distance is smaller than a preset distance threshold value, as the attention targets;

determining the actual space distance of each pair of the attention targets and the speed value of each attention target aiming at each frame image;

obtaining distance change information representing the actual space distance change condition of each pair of the attention targets and speed change information representing the speed value change condition of each attention target by sequentially comparing the next frame image with the previous frame image;

and describing the distance change information and the speed change information which are sequentially obtained by each concerned target by utilizing semantic tags, and generating a sub-activity tag set representing the activity state of each concerned target.

Optionally, the inputting the sub-activity tag set into a pre-obtained time and or graph model to obtain a prediction result of the future activity of the attention target in the parking lot includes:

and inputting the sub-activity label set into the time and OR graph model, and obtaining a prediction result of the future activity of the concerned target in the parking lot by using an online symbolic prediction algorithm of an Earley resolver, wherein the prediction result comprises the future sub-activity label of the concerned target and the occurrence probability value.

Optionally, the sending control information to the corresponding device of the parking lot based on the prediction result of the future activity of the target of interest includes:

and when the prediction result shows that the distance between a vehicle and the parking lot exit barrier is less than the preset distance, sending control information indicating charging to a charging device of the parking lot exit barrier.

In a second aspect, an embodiment of the present invention provides a parking lot management and control apparatus based on target activity prediction, where the apparatus includes:

the scene video acquisition module is used for acquiring a scene video aiming at the parking lot;

the space and or image model generation module is used for detecting and tracking a target in the scene video to generate a space and or image model of the parking lot; wherein the spatial and OR graph model represents the spatial position relation of the target in the scene video;

the sub-activity extraction module is used for obtaining a sub-activity label set representing the activity state of the attention target by using a sub-activity extraction algorithm on the space and OR graph model;

the target activity prediction module is used for inputting the sub-activity label set into a time and OR model obtained in advance to obtain a prediction result of the future activity of the concerned target in the parking lot; the time and/or map model is obtained by utilizing a pre-established active corpus of targets of the parking lot;

and the control information sending module is used for sending control information to corresponding equipment of the parking lot based on the prediction result of the future activity of the attention target.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, wherein,

the memory is used for storing a computer program;

the processor is configured to implement the steps of the parking lot management and control method based on target activity prediction according to the embodiment of the present invention when executing the program stored in the memory.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the parking lot management and control method based on target activity prediction provided by the embodiment of the present invention.

In the scheme provided by the embodiment of the invention, the space-time AND-OR graph is introduced into the field of target activity prediction for the first time. Firstly, target detection and tracking are carried out on a scene video of a parking lot, a space and OR diagram model of the parking lot is generated, and the space and OR diagram is used for representing the space position relation between targets. And secondly, performing sub-activity extraction on the space and or graph model to obtain a sub-activity label set of the concerned target, and realizing high-level semantic extraction of the scene video. And then taking the sub-activity label set as the input of a pre-obtained time and or graph model, and obtaining the prediction of the next sub-activity through the time syntax of the time and or graph. And finally, sending control information to corresponding equipment of the parking lot by using the prediction result to realize management control of the parking lot. According to the embodiment of the invention, the accuracy and the real-time performance of target activity prediction can be improved by utilizing the space-time AND/OR diagram, so that the target activity in the parking lot can be accurately and rapidly predicted, and the aim of effectively managing and controlling the parking lot is fulfilled.

Drawings

Fig. 1 is a schematic flowchart of a parking lot management and control method based on target activity prediction according to an embodiment of the present invention;

FIG. 2 is an exemplary diagram of a prior art AND/OR diagram;

FIG. 3 is an analytic view of FIG. 2;

FIG. 4 is a diagram of a space at a parking lot barrier as an example of an embodiment of the present invention;

FIG. 5 is a diagram illustrating the results of an exemplary time at parking fence grammar (T-AOG) according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a predictive parse tree at a parking lot barrier according to an exemplary embodiment of the present invention;

FIG. 7 is a diagram of the actual position of a vehicle at a parking lot barrier in an actual video;

FIG. 8 is a diagram illustrating an exemplary parking lot forecast sub-activity and actual sub-activity confusion matrix;

fig. 9 is a schematic structural diagram of a parking lot management and control apparatus based on target activity prediction according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method aims to accurately and quickly predict the target activities in the parking lot and effectively manage and control the parking lot. The embodiment of the invention provides a parking lot management and control method based on target activity prediction.

It should be noted that the execution subject of the parking lot management and control method based on target activity prediction according to the embodiment of the present invention may be a parking lot management and control device based on target activity prediction, and the device may be operated in an electronic device. The electronic device may be a server or a terminal device, but is not limited thereto.

In a first aspect, a parking lot management and control method based on target activity prediction according to an embodiment of the present invention is described.

As shown in fig. 1, a parking lot management and control method based on target activity prediction according to an embodiment of the present invention may include the following steps:

and S1, acquiring a scene video of the parking lot.

In the embodiment of the invention, the scene video at least comprises a moving target, and the target can be a human, a vehicle, an animal and the like.

The scene video may be obtained by a video capture device disposed at a parking lot. The video shooting device may include a camera, a video camera, a still camera, a mobile phone, etc., for example, a scene video may be shot by a camera installed on the ceiling of a parking lot.

The embodiment of the invention can acquire the scene video aiming at the parking lot from the video shooting equipment in a communication mode. The communication method is not limited to wireless communication, wired communication, and the like.

It can be understood that the acquired scene video contains a plurality of frames of images.

And S2, generating a space and/or map model of the parking lot by detecting and tracking the target in the scene video.

In the embodiment of the invention, the spatial and OR graph model represents the spatial position relation of the target in the scene video.

To facilitate understanding of the present solution, the concepts related to the and/or figures referred to in this section will be described first. The And-Or Graph (AOG) is a hierarchical composition model of a random context free grammar (SCSG) that represents a hierarchical decomposition from the top level to leaf nodes by a set of terminal And non-terminal nodes, outlining the basic concepts in the image grammar. Where the and node represents the target decomposition, or the node represents an alternative sub-configuration. Referring to fig. 2, fig. 2 is an exemplary diagram of a prior art and/or map. An and-or graph includes three types of nodes: and Node (And Node) (solid circle in fig. 2); an "Or" Node (Or Node) (dotted circle in fig. 2); terminal nodes (rectangles in fig. 2). The And Node (And Node) represents the decomposition of the entity into parts. It corresponds to a grammar rule such as B → ab, C → cd shown in fig. 2. Horizontal links with children of a node represent spatial positional relationships and constraints. Or nodes (Or nodes) act as "switches" that can be substituted for sub-structures and represent various levels of category labels, such as scene, object, and part categories, etc. It corresponds to a rule, e.g., A → B | C as shown in FIG. 2. Due to this recursive definition, the and-or maps of many object or scene classes can be merged into one larger and-or map. In theory, all scene and object classes can be represented by a large and-or graph. The Terminal Node, which may also be called a leaf Node, is a pixel-based high-level semantic visual dictionary. Due to the scaling property, the end nodes may appear in all levels of the and-or graph. Each end node takes an instance from a particular collection, called a dictionary, that contains various complex image patches. The elements in the set may be indexed by variables, such as their type, geometric transformation, deformation, appearance change, and the like. As shown in fig. 2, the leaf nodes constituting the rectangle a have four visual dictionaries abcd. The and-or graph defines a context-dependent image representation syntax in which the terminal nodes are their visual vocabulary and the nodes and or nodes are all production rules.

The and-or map contains all possible parse maps (pg), which are one possible configuration of the generation targets in the and-or map. The analytic graph is interpreted as an image. The parse graph pg consists of a hierarchical parse tree pt and a plurality of relationships E (defined as "horizontal edges"):

pg＝(pt,E) (1)

the parse tree pt is also an and tree in which non-terminal nodes are and nodes. The generation rule, which breaks each and node into its parts, now no longer generates a string, but a configuration, see fig. 3, fig. 3 being an analytical graph for fig. 2, which yields the configuration relations: r: B → C ═ a, B >, C denotes arrangement. With respect to the probabilistic model in the and-Or graph, it is primarily the probability that is learned at Or nodes so that the generated configuration accounts for the probability that such configuration occurs. Of course, fig. 2 has another analysis diagram including c and d, which is not shown here.

For And-Or graphs, a small part dictionary is used to represent objects in an image by And nodes And Or node hierarchies of the And-Or graph, And such a model can embody a Spatial combination structure of the objects in the image, And can also be referred to as a Spatial And-Or-graph (S-AOG) model. The space and or graph model represents the target by hierarchically combining components of the target in different spatial configurations based on the spatial positional relationship of the target. Therefore, the method can be used for analyzing the position relation of each target in image analysis, and specific applications such as target positioning and tracking are achieved. For example, the target recognition and tracking of complex scenes such as traffic intersections and squares can be realized, and the like.

Specifically, for S2, the method may include the following steps:

firstly, detecting targets in a scene video, and determining the category and the position of each target in each frame of image; the categories include people, cars, animals, etc. to distinguish the types of objects to which the objects belong; such as the range and coordinates of the area of the object in the image, etc.

Any target detection method can be adopted, such as traditional front-back background segmentation, target clustering algorithm and the like, or a target detection method based on deep learning and the like, which is reasonable.

Secondly, the same target in different frame images is determined by utilizing a target tracking technology.

The purpose of target tracking is to locate the position of a target in each frame of video image and generate a target motion track. The task of target tracking for an image is to determine the size and position of a target in an initial frame of a video sequence given the size and position of the target in a subsequent frame.

Any target tracking technology in the prior art, such as a tracking method based on Correlation filtering (Correlation Filter) or Convolutional Neural Network (CNN), may be used in the embodiments of the present invention.

Thirdly, the position relation between the objects in each frame of image is determined, such as distance, front-back orientation relation and the like.

And finally, carrying out spatial relationship decomposition on each target in the frame image to obtain a space and OR graph of the frame image, and integrating the space and OR graph corresponding to each frame image in the scene video to obtain a space and OR graph model of the parking lot.

In an alternative embodiment, S2 may include S21-S24:

and S21, detecting the targets in the scene video by using the target detection network obtained by pre-training to obtain the attribute information corresponding to each target in each frame of image of the scene video.

The target detection network of the embodiment of the invention may include: R-CNN, SPP Net, Fast R-CNN, Faster R-CNN, YOLO (Yoly Ok one), SSD (Single Shot MultiBox Detector), and the like.

In an alternative embodiment, the target detection network comprises a YOLO _ v3 network.

The YOLO _ v3 network comprises a backbone network (backbone) and three prediction branches, wherein the backbone network is a dark net-53 network, the YOLO _ v3 is a full convolution network, a large number of layer jump connections using residual errors are used, and in order to reduce the negative effect of the gradient caused by POOLing, POOLing is abandoned, and stride of conv is used for realizing downsampling. In this network structure, a convolution with a step size of 2 is used for down-sampling. Meanwhile, in order to enhance the accuracy of the algorithm for detecting the small target, a method of upsampling and Feature fusion similar to FPN (Feature Pyramid) is adopted in YOLO _ v3, and detection is performed on Feature maps of multiple scales. The three prediction branches adopt a full convolution structure. Compared with the traditional target detection algorithm, the target detection is carried out by adopting the pre-trained YOLO _ v3 network, so that the precision and the efficiency of the target detection can be improved, and the aims of prediction accuracy and real-time performance are fulfilled.

For the structure and specific detection process of the YOLO _ v3 network, please refer to the related description in the prior art, which is not described herein again.

Through a pre-trained YOLO _ v3 network, the attribute information corresponding to each target in each frame of image of the scene video can be obtained. Wherein the attribute information includes position information of a bounding box containing the object. The position information of the bounding box of the object is represented by (x, y, w, h), where (x, y) represents the center position coordinates of the current bounding box, and w and h represent the width and height of the bounding box, and those skilled in the art will appreciate that the attribute information includes, in addition to the position information of the bounding box, the confidence of the bounding box, which reflects the degree of confidence in the bounding box that contains the object, and the accuracy with which the bounding box predicts the object. The confidence is defined as:

if not, pr (object) is 0, confidence is 0; if it contains an object, pr (object) is 1, so the confidence level

The intersection ratio of the real bounding box and the predicted bounding box is obtained.

As will be understood by those skilled in the art, the attribute information also includes category information of the object. The category information indicates the category of the object such as a person, a vehicle, an animal, and the like. The category information may specifically include cars, vans, electric vehicles, and the like, for the vehicles.

It should be noted that, since a frame of video image may often contain several targets, some targets are far away or too small, or do not belong to "interested targets" in the parking lot, these are not targets with detection purpose. For example, for parking lots, moving vehicles and humans are of interest, while fire hydrants on roadsides are of non-interest targets. Thus, in a preferred embodiment, the preset number of targets can be detected for one frame of image by controlling and adjusting the YOLO _ v3 network setting in advance in the pre-training stage, for example, the preset number may be 30, 40, and so on. And meanwhile, the labeled training sample with detection purpose is used for training the YOLO _ v3 network, so that the YOLO _ v3 network has the autonomous learning performance, the trained YOLO _ v3 network can be used as a scene video of a test sample aiming at unknown objects, and attribute information corresponding to a preset number of objects with detection purpose in each frame of image can be obtained, so that the object detection efficiency and the detection purpose are improved.

Then, before S21, the YOLO _ v3 network needs to be trained in advance for the parking lot, and it can be understood by those skilled in the art that sample data used in the pre-training is sample scene video and sample attribute information in the parking lot scene, where the sample attribute information includes category information of objects in each frame image of the sample scene video and position information of bounding boxes containing the objects.

The pre-training process can be briefly described as the following steps:

1) and taking the attribute information of each frame image of the sample scene video corresponding to the target as a true value corresponding to the frame image, and training each frame image and the corresponding true value through a YOLO _ v3 network to obtain a training result of each frame image.

2) And comparing the training result of each frame of image with the true value corresponding to the frame of image to obtain the output result corresponding to the frame of image.

3) And calculating the loss value of the network according to the output result corresponding to each frame of image.

4) And adjusting parameters of the network according to the loss value, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum value, which means that the training result of each frame of image is consistent with the true value corresponding to the frame of image, thereby completing the training of the network, namely obtaining the pre-trained YOLO _ v3 network.

For a parking lot, a large number of sample scene videos need to be obtained in advance, manual or machine labeling is carried out, the class information of each frame of image in each sample scene video corresponding to a target and the position information of a boundary box containing the target are obtained, and the YOLO _ v3 network has the target detection performance in the scene through a pre-training process.

In an alternative embodiment, the pre-trained Yolo _ v3 network is trained from MARS dataset and Vehicle weight recognition dataset mean Re-ID databases Collection. As will be appreciated by those skilled in the art, both the MARS dataset and the Vehicle Re-ID databases Collection are open source Datasets. Wherein the MARS data Set (Motion Analysis and Re-identification Set) is for pedestrians and the Vehicle Re-ID data collections is for vehicles.

And S22, matching the same target in each frame of image of the scene video by using a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image.

The early target detection and tracking mainly aims at pedestrian detection, the detection idea is mainly to realize detection according to a traditional characteristic point detection method, and then tracking is realized by filtering and matching characteristic points. Such as pedestrian detection based on histogram of oriented gradient features (HOG), early pedestrian detection achieves various problems of missing detection, false alarm, repeated detection and the like. With the development of deep convolutional neural networks in recent years, various methods for detecting and tracking targets by using high-precision detection results appear.

Since a plurality of targets exist in a parking lot targeted by the embodiment of the present invention, target Tracking needs to be implemented by using a Multiple Object Tracking (MOT) algorithm. The multi-target tracking problem can be regarded as a data association problem, aiming at associating cross-frame detection results in a video frame sequence. By using a preset multi-target tracking algorithm to track and detect the target in the scene video, the bounding boxes of the same target in the front and back frames of the scene video in different frame images and the ID (Identity document) of the target can be obtained, i.e. the matching of the same target in each frame image is realized.

In an optional implementation manner, the preset multi-target tracking algorithm may include: SORT (simple Online and Realtime tracking) algorithm.

The SORT algorithm uses a TDB (tracking-by-detection) method, the tracking means is to use Kalman filtering tracking to realize target motion state estimation, and the Hungarian assignment algorithm is used for position matching. The SORT algorithm does not use any object appearance features in the object tracking process, but only uses the position and size of the bounding box for motion estimation and data correlation of the object. Therefore, the complexity of the SORT algorithm is low, the tracker can realize the speed of 260Hz, the target tracking detection speed is high, and the real-time requirement in the scene video of the embodiment of the invention can be met.

Because the SORT algorithm does not consider the problem of shielding, and does not carry out target re-identification through the appearance characteristics of the target, the SORT algorithm is more suitable for being applied to the parking lot without shielding of the target, for example, the crowd density is not large, and the parking lot without shielding can be generated.

In another optional embodiment, the preset multi-target tracking algorithm may include: deepsort (simple online and real time tracking with a deep association metric) algorithm.

DeepSort is an improvement on the basis of SORT target tracking, a Kalman filtering algorithm is used for carrying out track preprocessing and state estimation and is associated with a Hungarian algorithm, a deep learning model trained on a line re-identification data set is introduced into the algorithm on the basis of improving the SORT algorithm, and nearest neighbor matching is carried out by extracting depth appearance features of targets in order to improve the shielding condition of the targets in a video and the problem of frequent switching of target IDs when the targets are tracked on a real-time video. The core idea of deep sort is to use recursive kalman filtering and data correlation between several frames for tracking. Deep Association Metric (Deep Association Metric) is added to the Deep SORT on the basis of the SORT, and the purpose is to distinguish different pedestrians. Appearance Information (Appearance Information) is added to realize target tracking of longer-time occlusion. The algorithm is faster and more accurate than the SORT speed in real-time multi-target tracking.

For the specific tracking procedure of the SORT algorithm and the DeepSort algorithm, please refer to the related prior art for understanding, and the detailed description thereof is omitted here.

And S23, determining the actual spatial distance between different objects in each frame of image.

Through the target detection and tracking in the previous steps, the position information of each target of each frame of image in the scene video can be obtained, but the position information of each target is not enough to represent the relation of each target in the parking lot. Therefore, this step requires determining the actual spatial distance between different objects in each frame of image, and defining the spatial composition relationship of the objects by using the actual spatial distance between the two objects. Therefore, accurate results can be obtained when the constructed space and/or graph model is used for prediction in the follow-up process.

In an alternative embodiment, the principle of equal scaling may be used to determine the actual spatial distance between two objects in the image. Specifically, the actual spatial distance between the two test targets may be measured in a scene (in this scheme, a parking lot scene), a frame of image including the two test targets is captured, and then the pixel distance between the two test targets in the image is calculated, so as to obtain the number of pixels corresponding to the unit length in reality, for example, the number of pixels corresponding to 1 meter in reality. Then, for two new targets needing to detect the actual spatial distance, the pixel number corresponding to the unit length in the actual scene is used as a factor, and the pixel distance of the two targets in one frame of image shot in the scene can be scaled by using a formula, so as to obtain the actual spatial distance of the two targets.

It will be appreciated that this approach is simple to implement, but is well suited to situations where the image is not distorted. In the case of distortion of an image, the pixel coordinates and the physical coordinates do not correspond to each other one to one, and the distortion needs to be corrected. Such as distortion removal by correcting the picture through cvinituninhibitormap and cvRemap, and so on. The implementation of such scaling and the specific process of image distortion modification can be understood by referring to the related art, and are not described herein again.

Alternatively, a monocular distance measurement may be used to determine the actual spatial distance between two targets in the image.

The monocular camera model may be considered approximately as a pinhole model. Namely, the distance measurement is realized by using the pinhole imaging principle. Optionally, a similar triangle may be constructed through a spatial position relationship between the camera and the actual object and a position relationship of the target in the image, and then an actual spatial distance between the targets is calculated.

Optionally, a correlation algorithm of a monocular distance measurement mode in the prior art may be used, and a horizontal distance d between an actual position of a pixel point and a video shooting device (video camera/camera) is calculated by using a pixel coordinate of the pixel point of the target_xAnd a vertical distance d_yThus realizing monocular distance measurement. Then through the known actual coordinates and d of the video shooting device_x、d_yAnd deducing and calculating the actual coordinates of the pixel points. Then, for two objects in the image, the actual spatial distance between the two objects can be calculated using the actual coordinates of the two objects.

In an optional implementation manner, the actual spatial distance between two targets in the image may be determined by calculating the actual coordinate point corresponding to the pixel point of the target.

And calculating the actual coordinate of the pixel point by the actual coordinate point corresponding to the pixel point of the calculation target.

Optionally, a monocular visual positioning and ranging technique may be employed to obtain the actual coordinates of the pixels.

The monocular vision positioning distance measuring technology has the advantages of low cost and fast calculation. Specifically, two modes can be included:

1) and obtaining the actual coordinates of each pixel by utilizing positioning measurement interpolation.

Taking into account the isometric enlargement of the pinhole imaging model, the measurement can be performed by directly printing paper full of equidistant array dots. And measuring equidistant array points (such as a calibration plate) at a higher distance, interpolating, and then carrying out equal-proportion amplification to obtain the actual ground coordinates corresponding to each pixel point. Such an operation can eliminate the need to manually measure the graphical indicia on the ground. After the dot pitch on the paper is measured, H/H (height ratio) is amplified to obtain the coordinates of the pixel corresponding to the actual ground. In order to avoid that the keystone distortion of the upper edge of the image is too severe, so that the mark points on the printing paper are not easy to identify, the method needs to prepare equidistant array circular point maps with different distances.

2) And calculating the actual coordinates of the pixel points according to the similar triangular proportion.

The main idea of this approach is still the pinhole imaging model. But the requirement for calibrating video shooting equipment (video camera/still camera/camera) is higher, and the distortion caused by the lens is smaller, but the method has stronger transportability and practicability. The video camera may be calibrated, for example, by using MATLAB or OPENCV, and then the conversion calculation of the pixel coordinates in the image is performed.

In the following description, an alternative to this mode is selected, and S23 may include S231 to S233:

s231, determining the pixel coordinates of each target in each frame of image;

for example, a boundary box containing the target and pixel coordinates of all pixel points in the boundary box can be determined as pixel coordinates of the target; or a pixel point on or in the bounding box may be selected as the pixel coordinate of the target, that is, the pixel coordinate of the target is used to represent the target, for example, the center position coordinate of the bounding box may be selected as the pixel coordinate of the target, and so on.

S232, aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

the pixel coordinates of any pixel in the image are known. The imaging process of the camera involves four coordinate systems: a world coordinate system, a camera coordinate system, an image physical coordinate system (also called an imaging plane coordinate system), a pixel coordinate system, and a transformation of these four coordinate systems. The transformation relationships between these four coordinate systems are known and derivable in the prior art. Then, the actual coordinates of the pixel points in the image in the world coordinate system can be calculated by using a coordinate system transformation formula, for example, the actual coordinates in the world coordinate system can be obtained from the pixel coordinates by using many public algorithm programs in OPENCV language. Specifically, for example, the corresponding world coordinates are obtained by inputting the camera parameters, rotation vectors, translation vectors, pixel coordinates, and the like in some OPENCV programs, using a correlation function.

The actual coordinates of the center position of the bounding box representing the target A in the world coordinate system are assumed to be (X)_A,Y_A) The actual coordinate corresponding to the coordinate of the center position of the bounding box representing the target B in the world coordinate system is (X)_B,Y_B). Further, if the object A has an actual height, the actual coordinates of the object A are

Where H is the actual height of the object a and H is the height of the video capture device.

And S233, aiming at each frame of image, obtaining the actual space distance between every two targets in the frame of image by using the actual coordinates of every two targets in the frame of image.

The method for solving the distance between two points by using actual coordinates belongs to the prior art. For the above example, the actual spatial distance D between targets a and B, without considering the actual height of the targets, is:

of course, the case of considering the target actual height is similar thereto.

Optionally, if the multiple pixel coordinates of the objects a and B are obtained in S231, it is reasonable to calculate multiple actual distances between the objects a and B by using the multiple pixel coordinates, and then select one of the actual distances as the actual spatial distance between the objects a and B according to a certain selection criterion, for example, select the minimum actual distance as the actual spatial distance between the objects a and B.

Details of the above-mentioned solutions can be found in computer vision (computer vision) and related concepts related to camera Calibration (camera Calibration), world coordinate system, camera coordinate system, image physical coordinate system (also called imaging plane coordinate system), pixel coordinate system, LABVIEW vision development, OPENCV correlation algorithm, LABVIEW paradigm, calibriation paradigm, etc., which are not described herein again.

In an optional implementation, determining the actual spatial distance between different targets in each frame of image may also be implemented by using a binocular camera optical image ranging method.

The binocular cameras are the same as the human binoculars, the images of the same object shot by the two cameras have difference due to different angles and positions, the difference is called as parallax, the size of the parallax is related to the distance between the object and the cameras, and the target can be positioned according to the principle. The binocular camera optical image ranging is realized by calculating the parallax of two images shot by a left camera and a right camera. The specific method is similar to monocular camera optical image ranging, but has more accurate ranging and positioning information compared with a monocular camera. For a specific distance measurement process of the binocular camera optical image distance measurement method, reference is made to related prior art, and details are not repeated here.

In an alternative embodiment, determining the actual spatial distance between different objects in each frame of image may also include:

and aiming at each frame of image, obtaining the actual spatial distance between the two targets in the frame of image by using a depth camera ranging method.

The depth camera ranging method can directly obtain the depth information of the target from the image, the actual space distance between the target and the video shooting equipment can be accurately and quickly obtained without coordinate calculation, and therefore the actual space distance between the two targets is determined. For a specific distance measurement process of the depth camera distance measurement method, please refer to the related prior art, which is not described herein.

And S24, generating a space and OR model of the parking lot by using the attribute information and the actual space distance of the target corresponding to each matched frame image.

In this step, for each frame image, the detected objects and the attribute information of the objects are used as leaf nodes of the space and or graph, and the actual space distance between different objects is used as the space constraint of the space and or graph, so as to generate the space and or graph of the frame image. And forming a space and OR diagram model of the parking lot by the space and OR diagrams of all the frame images.

Referring to fig. 4, fig. 4 is a space and/or diagram of a parking lot barrier according to an exemplary embodiment of the present invention.

The top view in fig. 4 represents a frame of image at a parking lot barrier, which is the root node of the space and or map. Two targets are detected by the above method, which are the lower left and right images of fig. 4. The left image is a fence 1, the image is marked with category information 'nonce' to represent the fence, and is also marked with a bounding box of the fence; the right image is a vehicle 2, and the image is labeled with category information "car" indicating a car and a bounding box of the vehicle. The above category information and the position information of the bounding box are the attribute information of the object. And if the same object in different frame images, such as the vehicle 2, is also labeled with its ID, the car is distinguished from other objects in different frame images, such as the ID of the car can be represented by a number or a symbol.

These two objects, and the corresponding attribute information, are leaf nodes of the space and or graph. Where the actual spatial distance between the two objects serves as a spatial and/or spatial constraint of the map (not shown in fig. 4).

For the generation process of a space and/or diagram, reference may be made to the description of related prior art, which is not described herein again.

Further, after the space and or map model of the parking lot is generated, a new scene and a new spatial position relationship between objects can be generated by using the space and or map model of the parking lot. For example, the space and map model of the upper and lower parking lots can be integrated to obtain a new space and map model including multiple parking lots, thereby realizing scene expansion.

And S3, obtaining a sub-activity label set representing the activity state of the attention target by using a sub-activity extraction algorithm on the space and or graph model.

S1-S2 realize the detection of leaf nodes of the space AND-OR graph. In the step, the sub-activities are extracted to obtain an event sequence of the sub-activity combination, so that the whole event represented by the scene video is expressed. It should be noted that the sub-activities extracted in this step are actually the target activities, and the sub-activities are described in terms of nodes of the or-graph leaf.

In an alternative embodiment, S3 may include S31-S34:

before S31, a sub-activity tag set, subactists, which is a string array for storing sub-activity tags, may be initialized to null. Then, S31 to S34 are executed.

S31, determining paired targets in the space and OR graph model, wherein the actual space distance of the paired targets is smaller than a preset distance threshold value, as attention targets;

optionally, in the space and or image model, a pair of objects whose actual spatial distance in the space and or image corresponding to the first frame image is smaller than a preset distance threshold is determined as the attention object.

If the actual space distance between two targets is small, it may indicate that there are more moving contacts, such as approaching, colliding, etc., for the two targets, therefore, it is necessary to continuously observe the two targets as the targets of interest to predict the future movement of the two targets; conversely, if the actual spatial distance between two targets is large, it means that the two targets are less likely to have activity intersections, and therefore, it is not necessary to perform corresponding activity prediction.

Therefore, in the first frame image, the actual spatial distance d between different objects is calculated, and a pair of objects whose actual spatial distance d is smaller than the preset distance threshold minDis is determined as the attention object. Different sizes of preset distance threshold minDis may be set for different parking lots, such as in a parking lot where a safe distance between objects (vehicles or people) is of interest, minDis may be 10 meters, etc.

Optionally, for S31, it may be:

and determining the pair of targets, of which the actual spatial distance in the space and or image is smaller than a preset distance threshold, corresponding to each frame of image except the last frame of image in the space and or image model as the attention target.

Namely, the operation of determining the attention target is carried out in each frame of image except the last frame of image, so as to find more attention targets in time.

S32, for each frame image, the actual spatial distance of each pair of the objects of interest and the velocity value of each object of interest are determined.

At this step, starting from the first frame image, the actual spatial Distance d of the attention target smaller than the preset Distance threshold minDis may be saved at Distance x; distance x is a multi-dimensional array that holds the actual spatial Distance d between different objects. Where x denotes a sequence number corresponding to an image, and x ═ 1 denotes a first frame image, for example.

Meanwhile, a speed value of the same attention target in each frame image can be calculated, and the speed value refers to the speed of the attention target in the current frame of the scene video. The calculation method of the velocity value of the target is briefly described below:

and calculating the speed value of an object, wherein the moving distance s and the moving time t of the object in the front frame image and the rear frame image are required to be obtained. The frame rate FPS of the camera is first calculated. Specifically, in the development software OpenCV, the number of frames per second FPS of the video can be calculated by using the self-contained get (CAP _ PROP _ FPS) and get (CV _ CAP _ PROP _ FPS) methods.

Once every k frames, there are:

t＝k/FPS(s) (3)

thus, the velocity value v of the target can be calculated by:

wherein (X)₁，Y₁) And (X)₂，Y₂) Respectively represent the actual coordinates of the target in the previous frame image and the next frame image, which can be obtained through the step S232. Since calculating the velocity value of the target of the current frame image requires using the previous frame image and the current frame image, it can be understood that the velocity value of the target can be obtained starting from the second frame image.

The speed of the attention target in the video can be calculated through the method, and the attention target is marked in the image. For example, in each frame image, a corresponding velocity value, such as 9.45m/s, is identified beside the bounding box of each object of interest.

For the same object of interest, the velocity value in the first frame image may be denoted by v1, the velocity value in the second frame image may be denoted by v2, …, and so on.

And S33, sequentially comparing the next frame image with the previous frame image to obtain distance change information representing the actual space distance change condition of each pair of attention targets and speed change information representing the speed value change condition of each attention target.

For example, for two objects of interest E and F, if the actual spatial distance between the two images in the previous frame is 30 and the actual spatial distance between the two images in the next frame is 20, it is known that the actual spatial distance between the two images is reduced, which is the distance change information between the two images. Similarly, if the velocity value of E in the previous frame image is 8m/s and the velocity value of E in the subsequent frame image is 10m/s, it is known that the velocity of E is faster, which is the velocity change information thereof.

And obtaining the distance change information and the speed change information of each concerned target corresponding to each frame image, which are generated in sequence until the images of all the frames are traversed.

And S34, describing the distance change information and the speed change information sequentially obtained by each concerned target by using the semantic tags, and generating a sub-activity tag set representing the activity state of each concerned target.

The step is to describe distance change information and speed change information into character forms, such as acceleration, deceleration, approaching, far-away and the like, by means of meanings to obtain sub-activity labels representing the activity state of the attention target, and finally obtain a sub-activity label set by the sub-activity labels which correspond to each frame of image and sequentially occur. The sub-activity label set embodies a sequence of sub-events of the scene video. The embodiment of the invention realizes the description of the scene video by utilizing the sub-activity label set, namely, the semantic description of the whole video is obtained by combining different sub-activities of each target in the video, and the semantic extraction of the scene video is realized.

The sub-activity definitions in embodiments of the present invention may refer to the manner in which sub-activity label definitions in the CAD-120 dataset are defined, and the shorter label schema helps to generalize nodes of the AND-OR graph.

In the steps, complete sub-activity tag sets subactive can be obtained.

According to the embodiment of the invention, aiming at the scene of the parking lot, when target activities (events) are analyzed, sub-activities (namely sub-events) in the scene can be defined, and each sub-activity can obtain a sub-activity label through the methods of target detection, tracking and speed calculation. The following sub-activity tags may be specifically defined:

parking (car _ stopping), people's immobility (person _ stopping), people's vehicle far away (away), vehicle acceleration (accelerate), vehicle deceleration (decelerate), vehicle uniform velocity (moving-uniform), people's vehicle near (closing) unmanned or non-vehicle (None), vehicle-rod approaching (closing), vehicle-rod far away (away) vehicle passing (passing), collision (shock), and the like.

It is understood that if the determination of the attention object is performed for each frame image except the last frame image in S31, the sub-activity tab set obtained using S32 to S34 includes a greater number of attention objects, for example, some attention objects are determined based on the second frame image, and so on.

And S4, inputting the sub-activity label set into a time and OR model obtained in advance, and obtaining a prediction result of the future activity of the attention target in the parking lot.

The time and/or map model is obtained by using a pre-established active corpus of targets of the parking lot.

A scene of a parking lot needs to be modeled in advance to represent a target activity (event). The time and or graph (T-AOG) is constructed, and active corpora of the targets of the parking lot need to be obtained, the corpora can be regarded as priori knowledge of videos of the parking lot, and the more comprehensive the target activities (events) contained in the corpora, the more accurate the constructed T-AOG model is.

The time and OR graph model construction process comprises the following steps:

firstly, observing a sample scene video of the parking lot, extracting corpora of various events related to the target in the sample scene video, and establishing a movable corpus of the target of the parking lot.

The activity corpus of the target in the parking lot represents the activity state of the target by a sub-activity label, and the event is composed of a set of sub-activities.

By analyzing different sample scene videos in a parking lot, a corpus of events is obtained, corpora, which are possible combinations of leaf nodes that appear in time order, for example, the next corpus may represent a video: "closing person _ stopping moving _ uniform walking away", can be expressed as: the man and the vehicle are close to each other, the man and the vehicle are not moved, the vehicle passes through at a constant speed, the vehicle is parked, the man and the vehicle pass through, and the man and the vehicle are far away.

The embodiment of the invention requires that the obtained scene corpus contains the events in the scene as much as possible, so that the target activity prediction can be more accurate.

And secondly, learning the symbol grammar structure of each event by using an ADIOS-based grammar induction algorithm for the target activity corpus of the parking lot, and taking the sub-activities as terminal nodes of the time and OR graph to obtain a time and OR graph model.

Specifically, the ADIOS-based grammar induction algorithm learns the AND Node (And Node) And/Or the Node (Or Node) by generating important patterns And equivalent classes. The algorithm first loads the active corpus onto the graph whose vertices are children and expands by two special symbols (start and end). Each event sample is represented by a separate path on the graph. Candidate patterns are then generated by traversing the different search paths. At each iteration, each sub-path is tested for statistical significance according to context sensitivity criteria. The important patterns are identified as AND nodes; the algorithm then finds equivalent classes by looking for units that are interchangeable in a given context. The equivalence class is identified as an OR node. At the end of the iteration, the important pattern is added as a new node to the graph, replacing the sub-paths it contains. Raw sequence data of symbolic sub-activities can be obtained from an activity corpus of targets of a parking lot, and a symbolic grammar structure of each event can be learned from the raw sequence data of the symbolic sub-activities using an ADIOS-based grammar induction algorithm. Shorter significance patterns tend to be used in embodiments of the present invention so that basic grammar elements can be captured. The algorithm learns the Add node And the Or node by generating important patterns And equivalent classes. As an example, the T-AOG generated using the parking lot corpus is shown in fig. 5, and fig. 5 is a diagram of a result of a time grammar (T-AOG) at a fence of a parking lot as an example according to an embodiment of the present invention. The double-wire circle And single-wire circle nodes are the Add node And the Or node, respectively. The number on the branch edge of the Or node (fraction less than 1) represents the branch probability. The numbers on the edges of the And nodes represent the extended time order.

After obtaining the time and or map model, for S4, the following steps may be included:

and inputting the sub-activity label set into a time and OR model, and obtaining a prediction result of the future activity of the concerned target in the parking lot by using an online symbol prediction algorithm of an Earley resolver, wherein the prediction result comprises the future sub-activity label of the concerned target and the occurrence probability value.

The sub-activity labels represent the position relation or motion state of the paired attention targets at the future moment. For S4, it may be that the sub-activity label set containing each pair of objects of interest is input into a time and or graph model, and then the prediction result may include the future sub-activity labels and probability values of occurrence for each pair of objects of interest. It is of course reasonable to input a sub-activity label set containing a certain pair of objects of interest into the time and or map model to obtain the future sub-activity labels and the probability values of occurrence of the pair of objects of interest.

The embodiment of the invention constructs the T-AOG through the activity corpus of the target of the parking lot, uses the sub-activity label set obtained by the S-AOG as the input of the T-AOG, and then predicts the next possible sub-activity on the T-AOG by adopting an online symbolic prediction algorithm based on an Earley resolver. The algorithm of the Earley parser is an algorithm for parsing sentences of a given context free language. The Earley algorithm is designed based on a dynamic programming concept.

The symbolic prediction algorithm of the Earley parser is described below. The Earley parser reads the terminal symbols in order, creating a set of all pending derivations (states) consistent with the input of the current input terminal symbol. Given the next input symbol, the parser iteratively performs one of three basic operations (predict, scan, and complete) on each state in the current state set.

In the following description, α, β, and γ denote terminal or non-terminal characters of an arbitrary character string (including an empty character string), a1 and B1 denote single non-terminal character strings, and T denotes a terminal character.

And analyzing the character string by using the 'symbol' of Earley: the analysis a1 → α β, a1 → α · β for the character string a1 indicates that the symbol α has been analyzed and β is a character to be predicted.

The input position n is defined as a position after the nth character is accepted, and is defined as a position before input when the input position is 0. At each input location m, the parser generates a set of states S (m). Each state is a tuple (a1 → α · β, i) consisting of:

(1) composition of character string currently being matched (A1 → α β)

(2) The circle "·" indicates the current parsed position, α has been parsed, and β is the character to be predicted.

(3) i denotes an original position where matching is started, and a start-end position [ i, j ] of one character string: integer i represents the state start point (start point of the analyzed substring), integer j represents the state end point (end point of the analyzed substring), and i ≦ j.

The parser will repeatedly perform three operations: predict, scan, and complete:

prediction (Predicator): for each state of S (m) of the form (A1 → α. B1 β, i), the dot is followed by a non-terminal character, then for each character in the string B1 there is a match probability that (B1 →. γ, m) will be added to the left hand side of S (m) (e.g., B1 → γ) for each parsed character accompanying the grammar in B1;

scanning (Scanner): for each state of S (m) of the form (A1 → α.Tβ), if T is the next symbol in the input stream, the dot scans to the right by one character since T is the terminal character. I.e., adding (A1 → α T · β, i) to S (m + 1);

complete (Completer): for each state of s (m) of the form (a1 → γ · j), the state of the form (B1 → α · a1 β, i) in s (j) is found, and (B1 → α · a1 β, i) to s (m) are added;

in this process, duplicate states are not added to the state set. These three operations are repeated until no new state can be added to the state set.

The steps performed with respect to the symbolic prediction algorithm of the Earley parser may include:

let n words exist in the input sentence, and the character interval can be recorded as 0,1, …, n, that is, n +1 chart is generated.

The method comprises the following steps: an analytic rule forming state S →. a, [0,0] of the T-AOG rule, which is a shape of S → a, is added to chart [0 ].

Step two: for each state in chat [ i ], if the current state is 'unfinished state' and the following is not a terminal character T, executing Predicator; if the current state is 'unfinished state' and is followed by a terminal character T, then Scanner is executed; if the current state is "completion state", Completer is executed.

Step three: and if i is less than n, jumping to the step two, otherwise, ending the analysis.

Step four: if a state of the form S → · a, [0, n ] is finally obtained, the input string is received as a legal matrix, otherwise the analysis fails.

In an embodiment of the present invention, using the symbolic prediction algorithm of the Earley parser, the current sentence of the sub-campaign is used as input to the Earley parser, and all pending states are scanned to find the next possible end node (sub-campaign).

For details of the symbolic prediction algorithm of the Earley parser, refer to the description of the related art.

In summary, in the embodiment of the present invention, the target activity is represented by a space-time And Or graph (ST-AOG). The space-time and-or-map (ST-AOG) is composed of a space-time and-or-map (S-AOG) and a time-time and-or-map (T-AOG). The spatio-temporal or graph may be understood as being constructed using the root node of the spatio-temporal or graph as the leaf nodes of the spatio-temporal or graph. The S-AOG represents the state of a scene, the spatial relationship among the targets is hierarchically represented through the targets and the attributes of the targets, and the minimum sub-events (such as sub-event labels of people still, vehicle acceleration, people and vehicle approaching and the like) are represented through the spatial position relationship obtained by target detection. The root node of the S-AOG is a sub-active label, and the terminal node is a target and a relation between the targets. The T-AOG is a random time syntax and represents the hierarchy of the event into a plurality of sub-events, and simulates the target activity, wherein the root node of the hierarchy is the activity (event) and the terminal node of the hierarchy is the sub-activity (sub-event).

The learning of ST-AOG can be decomposed into two main parts: the first part is to learn the symbolic grammar structure (T-AOG) for each event/task. The second part is to learn the parameters of the ST-AOG, including the branch probabilities of OR nodes. Specific details regarding ST-AOG are not described herein.

S5, sending control information to the corresponding device of the parking lot based on the prediction result of the future activity of the attention target.

In an optional implementation manner, when the prediction result meets a preset alarm condition, control information for alarming may be sent to the alarm device in the parking lot to control the alarm device to send an alarm signal.

For example, when the prediction result is a collision, control information may be sent to the alarm device to control the alarm device to send an alarm signal. The alarm signal may comprise a sound and/or light signal or the like.

In an optional implementation manner, when the prediction result is that the distance between the two targets is smaller than a preset distance value representing a safe distance, the prediction result may send control information to a warning device, such as a broadcasting device, to control the warning device to send a warning signal to remind the targets of avoiding collision.

In an alternative embodiment, when the prediction result indicates that a vehicle is less than a preset distance from the parking lot exit barrier, control information indicating charging may be transmitted to the charging device of the parking lot exit barrier.

In particular, in such an embodiment, the focus is on the vehicle and the parking lot exit barrier. When the prediction result indicates that the distance between the vehicle and the parking lot exit barrier is less than the preset distance, for example, the distance between the vehicle head and the parking lot exit barrier is less than 5 meters, it indicates that the vehicle head approaches the parking lot exit barrier, that is, the vehicle is about to exit the parking lot exit. Then, a control message containing the prediction result and a control command of "prepare for charging" may be generated using the prediction result, and the control message may be transmitted to the charging device of the parking lot exit barrier, so that the charging device may stop the charging timing after receiving the control message, and perform charging related work such as verifying license plate information, making a charge settlement, displaying a charging amount, completing a charging confirmation, and the like.

In an alternative embodiment, when the prediction result indicates that a vehicle is less than a preset distance from the parking lot exit barrier, control information indicating clearance may be transmitted to the control device of the parking lot exit barrier.

Specifically, when the prediction result indicates that the distance between a vehicle and the parking lot exit barrier is less than the preset distance, the vehicle is about to exit the parking lot exit, and control information indicating release can be directly sent to the control device of the parking lot exit barrier, so that the control device directly opens the parking lot exit barrier after receiving the control information to facilitate the vehicle to pass through quickly.

Of course, the case of transmitting the control information to the corresponding device of the parking lot based on the prediction result of the target future event of interest is not limited to the above list.

In order to understand the target activity prediction result and effect of the embodiment of the invention, the parking lot barrier is exemplified below, and the situation that the vehicle is not stopped at the barrier when the vehicle is over speed and accidents easily occur is easy to happen. Therefore, accurate and quick activity prediction can reduce unnecessary dangers, and effective management and control of the parking lot are facilitated.

According to the video corpus of the parking lot, a T-AOG model (as shown in FIG. 5) is constructed by using the above method, and all events in the scene can be found in the T-AOG model. For this scenario, the parking lot barrier is stationary, which can be a constant target of interest, with additional targets being primarily vehicles and pedestrians approaching the barrier. That is, it is determined through S1 and S2 that there are two objects of interest: lancet and car 2.

Through the sub-activity extraction algorithm of S3, a sub-activity tag set, i.e., a statement representing a sub-event, is obtained.

Inputting the sub-activity label set into the T-AOG model, namely inputting the event statement sensor of the combined sub-activity as follows:

sentence＝'closing decelerate car_stopping accelerate'

and predicting the next possible sub-activity in the T-AOG model by adopting an online symbolic prediction algorithm of an Earley parser.

The program output result may be:

['closing','decelerate','car_stopping',accelerate']

(['passing'],[0.33])

(S closing(AND12 decelerate car_stopping accelerate)passing away)

Time elapsed:3.1240177154541

and obtaining a prediction parse tree fig. 6 is a schematic view of a prediction parse tree at a parking lot barrier according to an embodiment of the present invention. In the program output result, the first row represents the previously observed event statement, which is composed of sub-activities, i.e., the first activity sub-activity tag set. The second row represents the predicted string (sub-activity label) and probability. The last two rows represent parse tree statements and prediction time. In the parse tree, the lowermost character "aceleraterate" represents the observation at the current time, and the right characters "visiting" and "away" represent the characters predicted from the T-AOG model. I.e. the predicted next sub-activity is: "walking away" (away after passing through the fence). In combination with the graphs of fig. 7, the change of the actual spatial position relationship between the vehicle and the fence in the video can be seen, and fig. 7 is a graph of the change of the actual position of the vehicle at the fence of the parking lot in the actual video. Fig. 7 shows that left vehicle car2 is far away after passing through the barrier in the actual video, and it can be seen that the inter-target sub-activity in the actual video coincides with the predicted sub-activity of the embodiment of the present invention. Namely, the prediction result of the embodiment of the invention is consistent with the change among the targets in the video, and the embodiment of the invention has better prediction accuracy.

In addition, in the embodiment of the invention, in the experimental process of the sub-activity prediction, the sub-activities of multiple targets in the parking lot are extracted and analyzed, and then are compared with the sub-activities in the actual video. The accuracy of the resulting sub-activity results is predicted using the activity prediction methods herein using confusion matrix analysis.

In particular, a confusion matrix may be used to analyze the comparison of spatial position variations between actual targets and detected position variations. As shown in table 1, by conventional methods such as: the accuracy of the SVM model for carrying out target classification detection, the trained double-layer LSTM model, the VGG-16 network of R-CNN, the KGS Markov random field model and the sub-activity extraction of ATCRF on the CAD-120 data set is about 87 percent at most.

TABLE 1 comparison of accuracy in sub-campaign extraction for traditional target detection methods

	SVM	LSTM	VGG-16	KGS	ATCRF
						P/R(％)	33.4	42.3	-	83.9	87

Referring to fig. 8, fig. 8 is a diagram illustrating a confusion matrix of predicted sub-activities and actual sub-activities of a parking lot according to an embodiment of the present invention.

Wherein only a portion of the child activity tags have been extracted, comprising:

away from human-vehicle (away), close to human-vehicle (closing), unmanned or non-vehicle (None), people's immobility (person _ stopping), constant-speed (moving-uniform) and vehicle deceleration (decelerate).

As shown in fig. 8, the abscissa represents the true value of the sub-activity and the ordinate represents the predicted value of the sub-activity, which can be calculated from the graph, the predicted sub-activity label substantially conforms to the actual sub-activity. The prediction accuracy can reach about 90%, and is higher than that of the conventional target detection method which is used for obtaining the sub-activity label and then predicting. The result proves that the sub-activity prediction result in the embodiment of the invention is very accurate. Therefore, the parking lot can be effectively controlled based on the prediction result.

In a second aspect, corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a target activity prediction apparatus based on a scene, as shown in fig. 9, where the apparatus includes:

a scene video acquiring module 901, configured to acquire a scene video for a parking lot;

a space and or graph model generation module 902, configured to generate a space and or graph model of the parking lot by detecting and tracking a target in the scene video; the space and OR graph model represents the space position relation of the target in the scene video;

a sub-activity extraction module 903, configured to obtain a sub-activity label set representing an activity state of the attention target by using a sub-activity extraction algorithm on the spatial and or graph model;

a target activity prediction module 904, configured to input the sub-activity tag set into a pre-obtained time and or graph model, so as to obtain a prediction result of a future activity of the concerned target in the parking lot; the time and/or graph model is obtained by utilizing a pre-established active corpus of targets of the parking lot;

a control information sending module 905, configured to send control information to a corresponding device of the parking lot based on a prediction result of the future activity of the target of interest.

Optionally, the space and or graph model generating module 902 includes:

the target detection submodule is used for detecting the targets in the scene video by using a target detection network obtained by pre-training to obtain attribute information corresponding to each target in each frame of image of the scene video; wherein the attribute information includes position information of a bounding box containing the target;

the target tracking submodule is used for matching the same target in each frame of image of the scene video by utilizing a preset multi-target tracking algorithm based on the attribute information corresponding to each target in each frame of image;

the distance calculation submodule is used for determining the actual spatial distance between different targets in each frame of image;

and the model generation submodule is used for generating a space and/or image model of the parking lot by utilizing the attribute information and the actual space distance of the target corresponding to each matched frame image.

Optionally, the distance calculation submodule is specifically configured to:

in each frame image, determining the pixel coordinate of each target;

aiming at each target, calculating the corresponding actual coordinate of the pixel coordinate of the target in a world coordinate system by using a monocular vision positioning and ranging technology;

Optionally, the sub-activity extracting module 903 is specifically configured to:

determining paired targets, of which the actual space distance is smaller than a preset distance threshold value, in the space and or image model as attention targets;

determining the actual space distance of each pair of attention targets and the speed value of each attention target aiming at each frame image;

obtaining distance change information representing the actual space distance change condition of each pair of attention targets and speed change information representing the speed value change condition of each attention target by sequentially comparing the next frame image with the previous frame image;

and describing distance change information and speed change information which are sequentially obtained by each concerned target by utilizing the semantic tags, and generating a sub-activity tag set representing the activity state of each concerned target.

Optionally, the target activity prediction module 904 is specifically configured to:

Optionally, the control information sending module 905 is specifically configured to:

when the prediction result indicates that a vehicle is less than a preset distance from the parking lot exit barrier, control information indicating charging is transmitted to the charging device of the parking lot exit barrier.

For the specific execution process of each module, please refer to the method steps of the first aspect, which are not described herein again.

In a third aspect, an embodiment of the present invention further provides an electronic device, as shown in fig. 10, including a processor 1001, a communication interface 1002, a memory 1003 and a communication bus 1004, where the processor 1001, the communication interface 1002 and the memory 1003 complete communication with each other through the communication bus 1004,

a memory 1003 for storing a computer program;

the processor 1001 is configured to implement the steps of the parking lot management and control method based on the target activity prediction according to the first aspect when executing the program stored in the memory 1003.

The electronic device may be: desktop computers, laptop computers, intelligent mobile terminals, servers, and the like. Without limitation, any electronic device that can implement the present invention is within the scope of the present invention.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In a fourth aspect, corresponding to the parking lot management and control method based on target activity prediction provided in the first aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the parking lot management and control method based on target activity prediction provided in the embodiment of the present invention are implemented.

The computer-readable storage medium stores an application program that executes the parking lot management and control method based on target activity prediction provided by the embodiment of the present invention when executed.

For the apparatus/electronic device/storage medium embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

It should be noted that the apparatus, the electronic device, and the storage medium according to the embodiments of the present invention are respectively an apparatus, an electronic device, and a storage medium that apply the parking lot management and control method based on target activity prediction, and all embodiments of the parking lot management and control method based on target activity prediction are applicable to the apparatus, the electronic device, and the storage medium, and can achieve the same or similar beneficial effects.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A parking lot management and control method based on target activity prediction is characterized by comprising the following steps:

acquiring a scene video aiming at a parking lot;

2. The method of claim 1, wherein generating a spatial and OR map model of the parking lot by detecting and tracking objects in the scene video comprises:

3. The method of claim 2,

the target detection network comprises a YOLO _ v3 network; the preset multi-target tracking algorithm comprises a Deepsort algorithm.

4. The method according to claim 2 or 3, wherein the determining the actual spatial distance between different objects in each frame of image comprises:

in each frame image, determining the pixel coordinate of each target;

5. The method according to claim 4, wherein the obtaining of the sub-activity label set of the activity state characterizing the object of interest by using a sub-activity extraction algorithm on the spatial and OR graph model comprises:

6. The method of claim 5, wherein said entering said sub-activity tag set into a pre-derived time-and-or model to obtain a prediction of said target-of-interest future activity in said parking lot comprises:

7. The method of claim 1 or 6, wherein said sending control information to a corresponding device of the parking lot based on the predicted outcome of the future activity of the target of interest comprises:

8. A scene-based target activity prediction apparatus, comprising:

9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;

the memory is used for storing a computer program;

the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-7.

10. A computer-readable storage medium, characterized in that,

the computer-readable storage medium has stored therein a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.