The application relates to a human body fall detection method, a device, equipment and a medium combining depth measurement, which are filed on 12 months 15 days 2023, and are divisional application of an application patent application with the application number 202311742651.3.
Disclosure of Invention
In view of the above, the present invention provides a method, apparatus, device and medium for detecting a ground area in a video monitoring scene, which are used for solving the problem that the ground area detection cannot be accurately performed in the prior art.
The technical scheme adopted by the invention is as follows:
In a first aspect, the present invention provides a method for detecting a ground area in a video surveillance scene, which is characterized in that the method includes:
Inputting a real-time image obtained by decomposing a real-time video stream in a video monitoring scene into a pre-trained human body target detection model, and identifying whether a human body target appears in the current monitoring scene;
monitoring the rotation of the camera by using a visual mileage calculation method, and identifying whether the current monitoring visual angle rotates or not by comparing the pose changes between adjacent frames;
If the human body target is not identified in the current monitoring scene and the current monitoring visual angle is not rotated, filtering and converting the real-time image into a color space image;
Extracting features of the color space image, and fusing the extracted local features and global features to obtain multi-scale features;
According to the multi-scale characteristics, the probability of each pixel point in the ground area is obtained, a preset probability threshold value is obtained, and a probability map is converted into a binary segmentation map;
And carrying out morphological processing on the binary segmentation map, and taking the pixel point position information in the ground area as the ground area position information.
Preferably, if the human body target is not identified in the current monitoring scene and the current monitoring view angle is not rotated, performing image segmentation on the real-time image to obtain the ground area position information, and further including:
inputting each real-time image into a pre-trained human body detection model, and outputting human body region position information;
performing depth analysis on the human body region position information and the ground region position information to obtain a ground region average depth sequence and a human body region average depth sequence corresponding to a plurality of frames of time sequence images;
And carrying out depth difference calculation on the human body region average depth sequence and the ground region average depth sequence, and detecting human body falling according to the depth difference sequence.
Preferably, the performing depth analysis on the human body region position information and the ground region position information, and obtaining a ground region average depth sequence and a human body region average depth sequence corresponding to the multi-frame time sequence image includes:
Acquiring a training image with a depth label in a video monitoring scene, and inputting the training image into a deep learning network to obtain a training model;
Inputting the real-time image into the training model for forward propagation to obtain depth values corresponding to all pixel points in the real-time image;
According to the human body region position information and the ground region position information, combining depth values corresponding to all pixel points in the real-time image to obtain a human body region depth value sequence and a ground region depth value sequence corresponding to a multi-frame time sequence image;
And obtaining the ground area average depth sequence and the human area average depth sequence according to the human area depth value sequence and the ground area depth value sequence.
Preferably, the step of obtaining the sequence of depth values of the human body region and the sequence of depth values of the ground region corresponding to the multi-frame time sequence image according to the position information of the human body region and the position information of the ground region by combining the depth values corresponding to all the pixel points in the real-time image includes:
according to the human body region position information, outputting depth values corresponding to pixel points in a human body region to the human body region depth value sequence;
and outputting depth values corresponding to the pixel points in the ground area to the ground area depth value sequence according to the ground area position information.
Preferably, the obtaining the ground area average depth sequence and the human area average depth sequence according to the human area depth value sequence and the ground area depth value sequence includes:
acquiring a first pixel point sequence in a human body area and a second pixel point sequence in a ground area according to the human body area position information and the ground area position information;
obtaining a first sliding window and a second sliding window according to the first pixel point sequence and the second pixel point sequence;
obtaining a first filtering depth value sequence and a second filtering depth value sequence according to the first sliding window and the second sliding window;
And carrying out sliding window processing on the human body region depth value sequence and the ground region depth value sequence according to the first filtering depth value sequence and the second filtering depth value sequence to obtain the ground region average depth and the human body region average depth.
Preferably, the calculating the depth difference between the average depth sequence of the human body region and the average depth sequence of the ground region, and detecting the human body fall according to the depth difference sequence includes:
Acquiring a preset depth difference threshold;
Sequentially comparing each depth difference in the depth difference sequence with the depth difference threshold value to obtain a comparison result;
when the comparison results corresponding to the continuous multi-frame images are that the depth difference is smaller than the depth difference threshold value, the human body falling result is obtained that the human body falls on the ground.
Preferably, if the human body target is not identified in the current monitoring scene and the current monitoring view angle is not rotated, performing filtering processing and color space conversion on the real-time image, and before obtaining the color space image, including:
Acquiring a pre-trained human body target detection model based on yolov s architecture;
Inputting the real-time image into the human body target detection model, and identifying whether a human body target appears in the current monitoring scene;
And monitoring the rotation of the camera by using a visual mileage calculation method, and identifying whether the current monitoring visual angle rotates or not.
In a second aspect, the present invention provides a ground area detection device in a video surveillance scene, which is characterized in that the device includes:
the human body target detection module is used for inputting a real-time image obtained by decomposing a real-time video stream in a video monitoring scene into a pre-trained human body target detection model and identifying whether a human body target appears in the current monitoring scene;
the visual angle rotation monitoring module is used for monitoring rotation of the camera by utilizing a visual mileage calculation method and identifying whether the current monitoring visual angle rotates or not by comparing pose changes between adjacent frames;
The color space image acquisition module is used for carrying out filtering processing and color space conversion on the real-time image to obtain a color space image if a human body target is not identified in the current monitoring scene and the current monitoring visual angle is not rotated;
The feature extraction module is used for extracting features of the color space image, and fusing the extracted local features and global features to obtain multi-scale features;
the binary segmentation map acquisition module is used for acquiring the probability of each pixel point in the ground area according to the multi-scale characteristics, acquiring a preset probability threshold value and converting the probability map into a binary segmentation map;
The ground area position information acquisition module is used for carrying out morphological processing on the binary segmentation map and taking the pixel point position information in the ground area as the ground area position information.
In a third aspect, the present embodiment also provides an electronic device comprising at least one processor, at least one memory and computer program instructions stored in the memory, which when executed by the processor, implement the method of the first aspect as in the above embodiments.
In a fourth aspect, embodiments of the present invention also provide a storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as in the first aspect of the embodiments described above.
In summary, the beneficial effects of the invention are as follows:
The method comprises the steps of inputting a real-time image obtained by decomposing a real-time video stream in a video monitoring scene into a pre-trained human body target detection model, identifying whether a human body target appears in the current monitoring scene, monitoring rotation of a camera by a visual mileage calculation method, identifying whether a current monitoring view angle rotates by comparing pose changes between adjacent frames, if the human body target is not identified in the current monitoring scene and the current monitoring view angle does not rotate, carrying out filtering processing and color space conversion on the real-time image to obtain a color space image, carrying out feature extraction on the color space image, fusing the extracted local feature with global feature to obtain a multi-scale feature, acquiring probability of each pixel point in a ground area according to the multi-scale feature, acquiring a preset probability threshold, converting a probability map into a binary segmentation map, carrying out morphological processing on the binary segmentation map, and taking pixel point position information in the ground area as ground area position information. According to the invention, the accuracy of ground area detection is effectively improved by combining human body target detection and visual mileage monitoring, under the condition that no human body target is detected and the view angle of a monitoring camera is kept stable, filtering and color space conversion are carried out on a real-time image, so that noise interference is reduced, usable information of the image is enhanced, then, multi-scale features are extracted through fusion of local and global features, complex details of the ground area are captured, the multi-scale features enable probability estimation of pixels of the ground area to be more accurate, and morphological processing is combined, so that the ground area can be clearly segmented, erroneous judgment is reduced, the method is particularly suitable for complex monitoring scenes, stable and efficient ground area identification is provided, and accurate positioning of key areas in the monitoring scenes is ensured.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. In the description of the present application, it should be understood that the terms "center," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like indicate orientations or positional relationships based on the orientation or positional relationships shown in the drawings, merely to facilitate describing the present application and simplify the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of additional identical elements in a process, method, article, or apparatus that comprises the element. If not conflicting, the embodiments of the present application and the features of the embodiments may be combined with each other, which are all within the protection scope of the present application.
Example 1
Referring to fig. 1, embodiment 1 of the invention discloses a ground area detection method in a video monitoring scene, which comprises the following steps:
s1, acquiring a real-time video stream in a video monitoring scene, and decomposing the real-time video stream into multi-frame real-time images;
The method comprises the steps of obtaining a real-time video stream under a video monitoring scene captured by an obliquely installed monocular camera, decoding the obtained video stream and converting the video stream into an image sequence, wherein the step is carried out by using a common video codec (such as H.264 and H.265), the decoded video stream is decomposed into multi-frame images, each frame represents one moment of the video stream, the image sequence is composed of frames, the monocular camera is a camera with only one optical lens, and image information of one plane view can be obtained through the lens, and the common monocular camera comprises a common digital camera, a network camera, an industrial camera and the like. On the one hand, when the monocular camera is installed, if the monocular camera is installed vertically, the visual field range of the camera is limited, and the visual field range of the monocular camera can be enlarged by obliquely installing the monocular camera, so that the monitoring effect is improved, and on the other hand, when the monocular camera is installed vertically, the camera can be easily found out and obliquely installed, so that the camera is more concealed, and the safety is improved.
In one embodiment, please refer to fig. 2, the step S2 includes:
S201, acquiring a pre-trained human body target detection model based on yolov S architecture;
S202, inputting the real-time image into the human body target detection model, and identifying whether a human body target appears in a current monitoring scene;
And S203, monitoring the rotation of the camera by using a visual mileage calculation method, and identifying whether the current monitoring visual angle rotates or not.
Specifically, a machine learning or deep learning algorithm, such as yolov s algorithm, is used to build a human body detection model to detect whether a human body exists in the current video monitoring scene by learning a large amount of labeled training data, and a visual mileage calculation method is used to judge whether the camera rotates by comparing pose changes between adjacent frames. When the method judges that the current video monitoring scene does not detect a human body and the rotation of the camera is stopped, the method considers that the preset starting condition is reached and starts to divide the ground area in the image.
S2, if no human body target is identified in the current monitoring scene and the current monitoring visual angle is not rotated, image segmentation is carried out on the real-time image to obtain ground area position information;
Specifically, a starting condition is preset, wherein the preset starting condition comprises that a current video monitoring scene does not detect a human body and rotation of a camera is stopped, and only when the preset starting condition is reached, the ground area position information in a monocular image is segmented, so that on one hand, interference of human body information on a segmented ground area in the monitoring scene is avoided, and on the other hand, when the camera does not rotate, the monitoring scene does not change too much, and a human body falls down to cause severe change of the monitoring scene, therefore, only when the rotation of the camera is stopped, image segmentation is carried out, and unnecessary waste of working resources is avoided.
In one embodiment, referring to fig. 3, the step S2 includes:
s21, if no human body target is identified in the current monitoring scene and the current monitoring visual angle is not rotated, filtering and converting the real-time image to obtain a color space image, wherein the real-time image is a monocular image;
The method comprises the steps of obtaining a monocular image, preprocessing an input color image, wherein noise possibly existing in the image affects the segmentation result, so that Gaussian filtering is utilized to eliminate noise in the image, converting the preprocessed color image into an HSV color space, and selecting the HSV color space because the HSV color space can decompose color information into three components of hue, saturation and brightness, and hue components correspond to perception of colors by human eyes, so that elements with different colors can be distinguished more accurately, and ground and non-ground elements can be distinguished better.
S22, extracting features of the color space image, and fusing the extracted local features and global features to obtain multi-scale features;
Specifically, the converted color space image is subjected to multi-scale segmentation to obtain a plurality of subareas, and local features are extracted by using an LBP operator for each subarea. The LBP operator is an algorithm for texture feature extraction that can compare neighboring pixels around a pixel to the pixel to obtain a binary code that describes the texture feature around the pixel, and for the entire image, the global feature is extracted using a pre-trained ResNet model, resNet model is a deep convolutional neural network that can perform advanced feature extraction and classification on the image. And fusing the extracted local features and the global features through MLFN feature fusion networks to construct a multi-scale feature representation. Through multi-scale segmentation and application of an LBP operator, sub-regions with local texture features can be extracted, and the degree of distinguishing the features is enhanced. Meanwhile, the global features are extracted by utilizing the ResNet model, so that the context information of the whole image can be captured, the classification accuracy is improved, and finally, the expression capacity and the classification performance of the features can be further improved by fusing the local features and the global features.
S23, according to the multi-scale characteristics, obtaining the probability of each pixel point in the ground area to obtain a probability map;
s24, acquiring a preset probability threshold value, and converting the probability map into a binary segmentation map;
Specifically, the local features and the global features are spliced together to form feature vectors of each pixel point, and a classifier is trained by using the feature vectors, wherein the classifier classifies each pixel point into two types of ground elements and non-ground elements. After training, classifying each pixel point by using the classifier to obtain a probability map. The probability map represents the probability that each pixel belongs to a ground element, and the value range is between 0 and 1. The Otsu method is used to determine a suitable probability threshold. The Otsu method is a histogram-based adaptive threshold selection method, which can automatically determine an optimal threshold and divide the image into two parts. The Otsu method treats the gray values of the image as a probability distribution and determines an optimal gray threshold by minimizing the weighted sum of the intra-class variance and the inter-class variance. In our method, we consider the probability map as a gray scale image, determine an optimal probability threshold by Otsu method, and convert the probability map into a binary segmentation map, where the binary segmentation map classifies each pixel in the image into two classes, ground element and non-ground element. The probability value of the ground element can be calculated according to the characteristics of the color, the texture and the like of the pixel points by using the probability graph method, so that the problem of inaccurate segmentation caused by the change of the characteristics of the color, the texture and the like of the pixel points in the traditional segmentation method is avoided. Meanwhile, the probability threshold value can be determined in a self-adaptive mode by using the Otsu method, so that the separation result is more accurate.
And S25, carrying out morphological processing on the binary segmentation map, and taking the pixel point position information in the ground area as the ground area position information.
Specifically, morphological operations (open operation, close operation, etc.) are performed on the binary segmentation map, and small noise and fine connection are eliminated by the open operation, thereby obtaining a more accurate segmentation result. And the closed operation can fill the hollow and crack in the object, so that the object is more complete and communicated. The accuracy of the subsequent processing can be improved by the operation, and the obtained ground area pixels are stored according to the coordinate sequence, so that the ground area position information P (P1, P2, p3., pn) is obtained, wherein P1, P2, p3., pn represents coordinate information corresponding to a first pixel point to coordinate information corresponding to an nth pixel point in a camera coordinate system of the video camera.
S3, inputting each real-time image into a pre-trained human body detection model, and outputting human body region position information;
s4, carrying out depth analysis on the human body region position information and the ground region position information to obtain a ground region average depth sequence and a human body region average depth sequence corresponding to the multi-frame time sequence image;
Specifically, the human body position is detected through a yolov s target detection algorithm, a pixel set of a human body area is obtained, the human body area position information is obtained according to the pixel set of the human body area, and the image captured by the monocular camera can be converted into a three-dimensional scene by combining the ground area position information. Then, the average depth of the ground area and the average depth of the human body area are calculated, and because abnormal depth changes, such as sudden decrease or increase of the distance between the human body and the ground, occur in the fallen human body, the average depth of the ground area and the average depth of the human body area are used for detecting the falling of the human body, so that the conditions of false detection and omission detection are reduced, and the accuracy and the stability of the falling detection of the human body are improved.
In one embodiment, referring to fig. 4, the step S4 includes:
S41, acquiring a training image with a depth label in a video monitoring scene, and inputting the training image into a deep learning network to obtain a training model;
specifically, first, a training image with a depth label needs to be acquired and input into a deep learning network for training, so as to obtain a model capable of performing depth estimation on a monocular image. In the training process, the deep learning network learns the corresponding relation between the depth information in the input training image and the pixel value of the training image.
S42, inputting the real-time image into the training model for forward propagation to obtain depth values corresponding to all pixel points in the real-time image;
Specifically, after a trained model is obtained, a required monocular image is input into the model for forward propagation, and depth values corresponding to each pixel point in the input monocular image are obtained, wherein the depth values provide more accurate data support for subsequent falling detection.
S43, according to the human body region position information and the ground region position information, combining depth values corresponding to all pixel points in the real-time image to obtain a human body region depth value sequence and a ground region depth value sequence corresponding to a multi-frame time sequence image;
in one embodiment, referring to fig. 5, the step S43 includes:
s431, outputting depth values corresponding to pixel points in the human body region to the human body region depth value sequence according to the human body region position information;
Specifically, for the human body region, the coordinate position of each pixel point in the human body region can be determined according to the segmentation result of the previous human body detection model, then the depth value belonging to the human body region is screened out from the depth values corresponding to all the pixel points in the monocular image, and the depth value is output to the depth value sequence of the human body region.
And S432, outputting depth values corresponding to the pixel points in the ground area to the depth value sequence of the ground area according to the position information of the ground area.
Specifically, the ground area position information is obtained, and a set of depth values corresponding to pixels in the ground area is used as the ground area depth value sequence D (D1, D2, D3, and dn), wherein D1, D2, D3, and dn represent depth values corresponding to a first pixel to an nth pixel in the segmented ground area.
S44, obtaining the ground area average depth sequence and the human body area average depth sequence according to the human body area depth value sequence and the ground area depth value sequence.
Specifically, according to the human body region depth value sequence and the ground region depth value sequence, the ground region average depth and the human body region average depth are obtained, and after the ground region average depth and the human body region average depth are obtained, whether the human body falls down or not can be accurately judged. When the user falls down, the average depth of the human body area is obviously reduced, and compared with the average depth of the ground area, the average depth of the human body area is greatly deviated. Therefore, by comparing the average depth of the human body area with the average depth of the ground area, whether the human body falls down or not can be judged more reliably, and erroneous judgment is avoided.
In one embodiment, referring to fig. 6, the step S44 includes:
S441, acquiring a first pixel point sequence in a human body area and a second pixel point sequence in a ground area according to the human body area position information and the ground area position information;
S442, obtaining a first sliding window and a second sliding window according to the first pixel point sequence and the second pixel point sequence;
S443, obtaining a first filtering depth value sequence and a second filtering depth value sequence according to the first sliding window and the second sliding window;
Specifically, for each pixel point in the first pixel point sequence, adjacent points (w adjacent points before and after) are taken to form a 2w+1 sliding window, the sliding window is used as a first sliding window, the median of depth values corresponding to all the pixel points in the first sliding window is calculated and used as a filtered depth value, and the filtered depth values of all the pixel points in the first pixel point sequence are calculated to obtain a first filtered depth value sequence. Similarly, a second filter depth value sequence can be obtained, and the sliding window can be adjusted according to the actual application requirement. The median filtering of the depth values by using the sliding window can effectively eliminate isolated noise points and false detection points in the depth map, so that the depth map is smoother and more reliable. Because the depth map often has noise and false detection points in practical application, if filtering processing is not performed, the subsequent algorithm may be greatly affected. Therefore, the robustness and accuracy of the algorithm can be effectively improved by adopting sliding window median filtering. In addition, the size of the sliding window can be adjusted according to practical application requirements so as to achieve the optimal filtering effect.
S444, according to the first filtering depth value sequence and the second filtering depth value sequence, sliding window processing is conducted on the human body region depth value sequence and the ground region depth value sequence, and the average depth of the ground region and the average depth of the human body region are obtained.
Specifically, a first filtered depth value sequence d1_filtered and a second filtered depth value sequence d2_filtered are obtained, and the ground area average depth and the human body area average depth are calculated by using the following formula:
D_person_avg=(d11_filtered+d12_filtered+......+d1n_filtered)/n;
D_ground_avg=(d21_filtered+d22_filtered+......+d2m_filtered)/m;
Where D_person_avg represents the average depth of the human body region, D_group_avg represents the average depth of the ground region, d11_filtered, d12_filtered, D1 n-filtered represent the respective filtered depth values for the first pixel to the nth pixel in the first sequence of filtered depth values, d22_filtered, d2m—filtered represents the filtered depth values corresponding to the first pixel to the mth pixel in the second filtered depth value sequence, n is the number of points in the first sequence of filtered depth values and m is the number of points in the second sequence of filtered depth values.
And S5, carrying out depth difference calculation on the average depth sequence of the human body region and the average depth sequence of the ground region, and detecting falling of the human body according to the depth difference sequence.
Specifically, by acquiring a human body region average depth sequence corresponding to a multi-frame monocular image, more accurate human body position and posture information is obtained, and then according to a depth difference sequence between the human body region average depth sequence and the ground region average depth, the human body falling behavior is detected more accurately.
In one embodiment, referring to fig. 7, the step S5 includes:
S51, acquiring a preset depth difference threshold;
Specifically, in practical application, the required depth difference threshold value is also different due to different depth information acquisition modes of different scenes and different cameras. If the threshold value is set too small, the false detection rate may be high, i.e. the situation that the person does not fall is misjudged as falling, and if the threshold value is set too large, the false detection rate may be high, i.e. the situation that the person does not fall is not detected, so that a proper threshold value T is preset for judging whether the depth difference between the human body and the ground is small enough. This threshold may be adjusted based on the actual scene and camera parameters.
S52, sequentially comparing each depth difference in the depth difference sequence with the depth difference threshold value to obtain a comparison result;
Specifically, for each input monocular image, the difference between the average depth of the human body and the ground depth is calculated according to the following formula:
de lta_D_i=abs(D_person_avg_i–D_ground_avg)
The method comprises the steps of sequentially comparing the calculated depth difference value with a depth difference threshold value, considering that a human body falls when the depth difference value is smaller than the depth difference threshold value, and considering that the human body does not fall when the depth difference value is not smaller than the depth difference threshold value. And judging according to the set depth difference threshold value, and accurately detecting whether the human body falls down or not, so that effective video monitoring and safety management are realized.
And S53, when the comparison results corresponding to the continuous multi-frame images are that the depth difference is smaller than the depth difference threshold value, obtaining that the human body falls down on the ground.
In particular, if depth differences de lta_d_i of all frames are traversed and if depth differences de lta_d_i of consecutive multi-frames (e.g., consecutive 3 frames or more) are smaller than a threshold T, we can consider that the human body has fallen on the ground. Whether the human body falls down or not is judged through the depth difference value of continuous multiframes, so that the accuracy and the reliability of detection are improved. Because the position and the posture of the human body can change greatly when the human body falls down, misjudgment can be generated by simply judging the depth difference value of a certain frame, but the possibility of misjudgment can be reduced by judging the depth difference value of continuous multiframes, so that the accuracy of falling detection is improved. Meanwhile, the detection sensitivity and the real-time performance can be controlled by adjusting the frame number and the threshold value of continuous multiframes, so that the requirements of different scenes can be better met.
Example 2
Referring to fig. 8, embodiment 2 of the present invention further provides a ground area detection device in a video surveillance scene, where the device includes:
the image acquisition module is used for acquiring a real-time video stream in a video monitoring scene and decomposing the real-time video stream into multi-frame real-time images;
the image segmentation module is used for carrying out image segmentation on the real-time image to obtain ground area position information if a human body target is not identified in the current monitoring scene and the current monitoring visual angle is not rotated;
In an embodiment, if the human body target is not identified in the current monitoring scene and the current monitoring view angle is not rotated, performing image segmentation on the real-time image to obtain the ground area position information further includes:
The human body target detection model acquisition unit is used for acquiring a pre-trained human body target detection model based on yolov s architecture;
the human body target potential unit is used for inputting the real-time image into the human body target detection model and identifying whether a human body target appears in the current monitoring scene;
And the monitoring visual angle rotation identification unit is used for monitoring the rotation of the camera by using a visual mileage calculation method and identifying whether the current monitoring visual angle rotates or not.
In an embodiment, the image segmentation module comprises:
The filtering and color space converting unit is used for carrying out filtering processing and color space conversion on the real-time image to obtain a color space image if a human body target is not identified in the current monitoring scene and the current monitoring visual angle is not rotated, wherein the real-time image is a monocular image;
The feature extraction and fusion unit is used for extracting features of the color space image, and fusing the extracted local features and global features to obtain multi-scale features;
the probability map acquisition unit is used for acquiring the probability of each pixel point in the ground area according to the multi-scale characteristics to obtain a probability map;
the probability map conversion unit is used for acquiring a preset probability threshold value and converting the probability map into a binary segmentation map;
And the morphology processing unit is used for performing morphology processing on the binary segmentation map and taking the pixel point position information in the ground area as the ground area position information.
The human body detection module is used for inputting each real-time image into a pre-trained human body detection model and outputting human body region position information;
The depth analysis module is used for carrying out depth analysis on the human body region position information and the ground region position information to obtain a ground region average depth sequence and a human body region average depth sequence corresponding to the multi-frame time sequence images;
in an embodiment, the depth analysis module comprises:
The model training unit is used for acquiring training images with depth labels in the video monitoring scene, inputting the training images into a deep learning network and obtaining a training model;
The forward propagation unit is used for inputting the real-time image into the training model for forward propagation to obtain depth values corresponding to all pixel points in the real-time image;
The depth value sequence acquisition unit is used for acquiring a human body region depth value sequence and a ground region depth value sequence corresponding to a multi-frame time sequence image according to the human body region position information and the ground region position information and combining depth values corresponding to all pixel points in the real-time image;
in an embodiment, the depth value sequence acquisition unit includes:
The human body region depth value sequence obtaining subunit is used for outputting depth values corresponding to pixel points in a human body region to the human body region depth value sequence according to the human body region position information;
And the ground area depth value sequence acquisition subunit is used for outputting depth values corresponding to the pixel points in the ground area to the ground area depth value sequence according to the ground area position information.
The average depth sequence unit is used for obtaining the ground area average depth sequence and the human body area average depth sequence according to the human body area depth value sequence and the ground area depth value sequence.
In an embodiment, the average depth sequence unit acquisition unit includes:
The pixel point sequence and acquisition subunit is used for acquiring a first pixel point sequence in the human body region and a second pixel point sequence in the ground region according to the human body region position information and the ground region position information;
the sliding window acquisition subunit is used for obtaining a first sliding window and a second sliding window according to the first pixel point sequence and the second pixel point sequence;
the filtering processing subunit is used for obtaining a first filtering depth value sequence and a second filtering depth value sequence according to the first sliding window and the second sliding window;
And the sliding window processing subunit is used for carrying out sliding window processing on the human body region depth value sequence and the ground region depth value sequence according to the first filtering depth value sequence and the second filtering depth value sequence to obtain the ground region average depth and the human body region average depth.
And the falling detection module is used for calculating the depth difference of the human body region average depth sequence and the ground region average depth sequence and detecting falling of the human body according to the depth difference sequence.
In an embodiment, the fall detection module comprises:
the depth difference threshold value acquisition unit is used for acquiring a preset depth difference threshold value;
the depth difference comparison unit is used for sequentially comparing each depth difference in the depth difference sequence with the depth difference threshold value to obtain a comparison result;
And the falling judgment unit is used for obtaining the falling result of the human body as the falling of the human body on the ground when the comparison results corresponding to the continuous multi-frame images are all that the depth difference is smaller than the depth difference threshold value.
The ground area detection device in the video monitoring scene comprises an image acquisition module, an image segmentation module, a human body detection module, a depth analysis module and a fall detection module, wherein the image acquisition module is used for acquiring a real-time video stream in the video monitoring scene, decomposing the real-time video stream into multiple frames of real-time images, the image segmentation module is used for carrying out image segmentation on the real-time images to obtain ground area position information if a human body target is not recognized in the current monitoring scene and a current monitoring visual angle is not rotated, the human body detection module is used for inputting each real-time image into a pre-trained human body detection model to output the human body area position information, the depth analysis module is used for carrying out depth analysis on the human body area position information and the ground area position information to obtain a ground area average depth sequence and a human body area average depth sequence corresponding to multiple frames of time sequence images, and the fall detection module is used for carrying out depth difference calculation on the human body area average depth sequence and the ground area average depth sequence and detecting a human body fall according to the depth difference sequence. The device relies on real-time video streams under a video monitoring scene, is obtained through a common monitoring camera, does not need to introduce an additional depth sensor, so that hardware cost is reduced, pre-training of a human body detection model means that the model is not required to be trained from zero, development cost is reduced, a plurality of pre-training models are trained on large-scale data and used for human body detection of the common scene, position information of a ground area is obtained through real-time image segmentation, is realized through a common computer vision algorithm without complex depth image processing, cost is reduced, adaptability of a system to dynamic environment changes is improved by analyzing average depth sequences of the ground and the human body area corresponding to multi-frame time sequence images, misjudgment probability is reduced, and the system can pay more attention to changes of relative depths instead of absolute depth values through depth difference calculation, so that detection stability is improved.
Example 3
In addition, the ground area detection method in the video surveillance scene of embodiment 1 of the present invention described in connection with fig. 1 may be implemented by an electronic device. Fig. 9 shows a schematic hardware structure of an electronic device according to embodiment 3 of the present invention.
The electronic device may include a processor and memory storing computer program instructions.
In particular, the processor may comprise a Central Processing Unit (CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present invention.
The memory may include mass storage for data or instructions. By way of example, and not limitation, the memory may comprise a hard disk drive (HARD DISK DRIVE, HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) drive, or a combination of two or more of these. The memory may include removable or non-removable (or fixed) media, where appropriate. The memory may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory is a non-volatile solid state memory. In a particular embodiment, the memory includes Read Only Memory (ROM). The ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate.
The processor reads and executes the computer program instructions stored in the memory to implement the ground area detection method in any one of the video surveillance scenes of the above embodiment.
In one example, the electronic device may also include a communication interface and a bus. The processor, the memory, and the communication interface are connected by a bus and complete communication with each other, as shown in fig. 9.
The communication interface is mainly used for realizing communication among the modules, the devices, the units and/or the equipment in the embodiment of the invention.
The bus includes hardware, software, or both that couple the components of the device to each other. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an enhanced Industry Standard Architecture (ISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an industry standard architecture (ISPA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. The bus may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.
Example 4
In addition, in combination with the above-mentioned ground area detection method in the video surveillance scene in embodiment 1, embodiment 4 of the present invention may also provide a computer readable storage medium. The computer readable storage medium has stored thereon computer program instructions which when executed by a processor implement the method for detecting a ground area in any one of the video surveillance scenarios of the above embodiments.
In summary, the embodiment of the invention provides a method, a device, equipment and a medium for detecting a ground area in a video monitoring scene.
It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. The method processes of the present invention are not limited to the specific steps described and shown, but various changes, modifications and additions, or the order between steps may be made by those skilled in the art after appreciating the spirit of the present invention.
The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.
It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. The present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.
In the foregoing, only the specific embodiments of the present invention are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present invention is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and they should be included in the scope of the present invention.