Method and device for detecting and alarming human body falling in real time based on video
Technical Field
The invention relates to the field of real-time object detection, in particular to a method and a device for detecting and alarming human body falling in real time based on video.
Background
According to the report of Focus, the population over 60 years of age worldwide is expected to reach 20 hundred million by 2050. In this regard, this will represent more than one fifth of the population worldwide. However, since the body functions of the elderly are greatly reduced with age, most of the elderly suffer from cardiovascular diseases, osteoporosis and other symptoms, and the possibility of falling down of the elderly is increased due to the side effects of the diseases and the medicines. The old people can be sprain, contusion and fracture caused by falling, and even other diseases can be caused to occur. If the old cannot be cured and salvaged in time after falling, the life safety of the old is inevitably seriously threatened.
With the increasing population of elderly people, more advanced home monitoring is needed, while still allowing individuals to maintain personal autonomy, privacy. According to the data of the U.S. disease control and prevention center, nearly one quarter of the elderly fall each year, which is the main cause of traumatic hospitalization. The fall detection in the market at present is the touch type induction detection, one is a wearable sensor scheme, the other is an environment type scheme, the wearable sensor scheme is that a multi-axis acceleration sensor is worn on the body of the old, and the environment type scheme is that a sound and collision detection sensor is installed on the floor of a home environment. The former judges whether the person falls down through acceleration parameters, and the latter judges whether the person falls down through environmental parameters such as sound, vibration and the like. Both have some drawbacks. For example, a wrist strap touch-pressure detection fall device with publication number CN108041772a and an alarm system formed by the same and a wearable sensor scheme in an implementation method thereof have various limitations such as dislike of wearing, forgetting to wear, fixed wearing position, battery endurance and the like of the elderly. The environment type scheme in the floor and the method for detecting the falling of the publication number CN111538264A needs to modify the living environment, and the system has complex composition and higher cost. Therefore, the invention provides equipment for detecting falling based on video processing, which can detect falling of multiple targets and judge whether the targets fall or not.
Disclosure of Invention
Aiming at the problem that people easy to fall down and starting from the actual requirement of keeping personal autonomy and privacy, the invention provides a method and a device for detecting and alarming human falling down in real time based on video. The main starting point of the invention is that when people easy to fall, the equipment can send a falling message to relatives of the people easy to fall or community social workers to achieve an alarm means. Because the fall detection system uses the video image detection solution, the non-contact induction is provided through accurate image data, the sensor is not required to be worn on the body of people easy to fall or installed in the home environment, and only the camera is required to judge whether the target falls down or not and alarm. However, the camera can relate to the problem of user privacy, so that all the processing is performed through a local edge computing chip, falling of people can not be detected through a cloud server, and disclosure of user privacy is avoided.
The object of the invention is achieved by at least one of the following technical solutions.
A method for detecting and alarming human body falling in real time based on video comprises the following steps:
S1, adding human body image information in a falling state on the basis of a public human body detection data set, and establishing a falling detection data set;
S2, pruning is carried out on YOLOv-Tiny networks, a falling detection model is built, and the falling detection model is trained based on a self-built falling data set;
s3, acquiring new judging parameters for falling through giving different thresholds to the sensitivity degree of the aspect ratio alpha and the gravity center offset d of the images before and after the falling of the human body, and judging the falling;
s4, acquiring a real-time video as a video stream of the falling detection to enter a falling detection model, wherein the falling detection model firstly carries out human body target detection on a frame image obtained by processing the input video, and then carries out falling detection and alarm on the identified human body target.
Further, in step S1, the disclosed human body detection data set is subjected to preliminary screening, and useless data is removed, wherein the useless data is an image with only partial hands and legs and no human body trunk feature in the image, and 80% of the human body part features are the useful data;
Collecting and shooting multiple segments of human body falling videos, carrying out frame interception on the human body falling videos, intercepting multiple frames every second, manually screening out pictures between a standing state and a falling-down state, obtaining a data set with a human body falling posture, carrying out data labeling on the screened pictures, and labeling all the screened pictures as person labels;
Carrying out data enhancement on the data set with the human body falling gesture, generating a plurality of data enhancement pictures through image processing operation on the pictures in the data set with the human body falling gesture, wherein the data enhancement pictures comprise rotation, translation and stretching, and obtaining the data set with the human body falling gesture after the data enhancement;
combining the data-enhanced data set with the human body falling gesture with the screened public human body detection data set to establish a falling detection data set.
In step S2, the original weight of YOLOv-Tiny network model is trained in weight sparsification, namely, a scaling factor gamma introduced with BN layer is introduced into each channel to multiply the output of the channel;
The channel pruning and fine tuning are specifically as follows:
After the regular term of the scaling factor L1 is introduced, the scaling factors in the obtained model all tend to 0, then the absolute values of the scaling factors are firstly sequenced, the scaling factors at 80% of the positions in the scaling factors sequenced from small to large are taken as thresholds, channels corresponding to small scaling factors gamma below the thresholds are cut off, the channels corresponding to the small scaling factors gamma are essentially and directly cut off convolution kernels corresponding to the channels, and by doing so, a compact network which has fewer parameters, small occupied memory in running and low calculation amount can be obtained, namely a Prune-YOLOv-Tiny network model;
Changing the category C in the detection layer in Prune-YOLOv-Tiny network model into a single category, namely 1, simultaneously updating the values of 5 anchors in the detection layer for a fall detection dataset by using a K-means algorithm, wherein the values of the anchors are the width and the height of a prediction frame, and calculating the number R of convolution kernels in the last layer of convolution layer in Prune-YOLOv-Tiny network model by adopting a formula (1) to modify the number R into corresponding 30, wherein the number R is as follows:
R=Anchors*(5+C) (1)
Wherein, anchor is Prune-YOLOv-Tiny network model number of prediction frames, the number is 5, the obtained R is the output channel size, used for obtaining the tensor of the prediction output, generating the target frame, thus obtaining the fall detection model.
Further, in step S2, training is performed on the fall detection model through the data set to obtain weights in the fall detection model, which is specifically as follows:
Dividing a data set into a test set and a training set according to a ratio of 1:9, and calculating an IOU loss, a classification loss and a coordinate loss of a model prediction result and a real label on a fall detection model loaded with pre-training weights by each training picture in the training set, wherein the pre-training weights are existing weights;
when the falling detection model is fitted, the weight of the falling detection model is saved, the learning rate is adjusted, and the next training round is started.
Further, under the same test set, the weights of the falling detection models obtained through different rounds of training are compared, the scores of the accuracy and recall rate of different weights on the test set are compared, and the weight with the highest score is selected as the human body detection weight in the final falling detection model.
Further, in step S3, in the fall detection model, the aspect ratio α is calculated using formula (2), specifically as follows:
Wherein, alpha t is the length-width ratio of the human body detected by the image at the t frame, and h t is the length of the human body detected by the image at the t frame, and w t is the width of the human body detected by the image at the t frame;
the gravity center offset d is calculated by adopting a formula (3), and is specifically as follows:
dt+1=Pt-Pt+1 (3)
Wherein d t+1 is the gravity center offset of the human body detected by the image at the t+1st frame and the human body detected by the image of the previous frame, and P t and P t+1 are the gravity center point positions of the human body detected by the image at the t frame and the t+1st frame respectively, and comprise the position information of the ordinate and the abscissa;
Different thresholds are given through combination of the length-width ratio and the offset of the gravity center point, and new judging parameters for falling are obtained, so that whether the detected human body falls is judged, and the method is concretely as follows:
When alpha t is more than or equal to 1.1 and d t+1≥0.08*wt, judging that the patient falls;
In another case, when the aspect ratio of the human body detected by two continuous frames of images is larger than a threshold value, namely alpha t is larger than or equal to 1.5 and alpha t+1 is larger than or equal to 1.5, the human body is judged to fall, and the two thresholds are obtained through a large number of fall data comparison tests.
Further, in step S4, a real-time video is acquired as a video stream of fall detection, and is input into a fall detection model, the fall detection model firstly detects a human body target on a frame image obtained by processing the input video, then detects the identified human body target, and if it is determined that a human body falls in the real-time video, an alarm is given.
A device for detecting and alarming human body falling in real time based on video comprises a camera, a video decoding and encoding device and an edge computing chip;
The video decoding and encoding device is used for decoding and encoding the images acquired by the camera, and the edge computing chip is used for training a falling detection model and judging whether a human body falls in the video in real time.
Further, the video decoding and encoding device is a professional SMART IP CAMERA SoC Hi3516DV300, and the video acquired by the camera is processed into a video stream of 1920x1080@30fps and is transmitted to the edge computing chip.
Further, in the edge computing chip, a falling detection model is obtained through training according to the disclosed human body detection data set;
The edge computing chip processes the video stream transmitted by the video decoding and encoding device into a frame image and inputs a falling detection model, the falling detection model accelerates to compute whether a human body target exists in the frame image, and computes whether the human body target falls, and if the human body of the target is judged to fall, the edge computing chip sends a signal to alarm.
Compared with the prior art, the invention has the advantages that:
according to the method, the human body detection data set is established for the falling human body model in a targeted manner, the model is pruned, the model size is reduced, the detection precision and rate are improved, the falling detection of deep learning under the localization processing real-time video is realized, cloud computing is not needed, and the falling detection under the multi-person scene is realized under the condition that the privacy of a user is not leaked. Meanwhile, the defects that the wearable falling detection equipment is conflicted with wearing and battery endurance are overcome, the defects that the environment type falling detection system is complex in composition and high in cost are overcome, usability is improved, and detection cost is reduced. The device can be applied to people and places where falling easily occurs, and the rescue efficiency of the falling people is improved.
Drawings
Fig. 1 is a schematic flow chart of a method for detecting and alarming human body falling in real time based on video in an embodiment of the invention;
fig. 2 is a flowchart illustrating an operation of a device for detecting and alarming a fall of a human body in real time based on video according to an embodiment of the present invention;
fig. 3 is a flowchart of a fall algorithm determination in an embodiment of the invention;
fig. 4 is a training flowchart of a fall detection model according to an embodiment of the invention;
fig. 5 is a flow chart of the preparation of a fall data set according to an embodiment of the invention;
FIG. 6 is a block diagram of a fall detection model in an embodiment of the invention;
Fig. 7 is a block diagram showing a device for detecting and alarming human body falling in real time based on video in embodiment 3 of the present invention;
fig. 8 is a flowchart for producing a physical fall data set for the elderly in embodiment 2 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, a detailed description of the specific implementation of the present invention will be given below with reference to the accompanying drawings and examples.
Example 1:
A method for detecting and alarming human body falling in real time based on video, as shown in figure 1, comprises the following steps:
s1, adding human body image information in a falling state on the basis of the disclosed human body detection data set, and establishing a falling detection data set, as shown in fig. 5;
the disclosed human body detection data set is subjected to preliminary screening, useless data are removed, the useless data refer to images with only partial hands and legs and no human body trunk characteristics in the images, and 80% of the human body part characteristics appear in the images to be useful data;
Collecting and shooting multiple segments of human body falling videos, carrying out frame interception on the human body falling videos, intercepting 8 frames per second, manually screening out pictures between a standing state and a falling-down state, obtaining a data set with a human body falling posture, carrying out data annotation on the screened pictures, and marking all the screened pictures as person labels;
Carrying out data enhancement on the data set with the human body falling gesture, generating 10 data enhancement pictures through image processing operation on the pictures in the data set with the human body falling gesture, wherein the data enhancement pictures comprise rotation, translation and stretching, and the data set with the human body falling gesture after the data enhancement is obtained;
combining the data-enhanced data set with the human body falling gesture with the screened public human body detection data set to establish a falling detection data set.
S2, pruning is carried out on YOLOv-Tiny networks, a falling detection model is built, and the falling detection model is trained based on a self-built falling data set;
Carrying out weight sparsification training on the original weight of YOLOv-Tiny network model, namely introducing a scaling factor gamma introduced with a BN layer into each channel to multiply the output of the channel;
The channel pruning and fine tuning are specifically as follows:
After the regular term of the scaling factor L1 is introduced, the scaling factors in the obtained model all tend to 0, then the absolute values of the scaling factors are firstly sequenced, the scaling factors at 80% of the positions in the scaling factors sequenced from small to large are taken as thresholds, channels corresponding to small scaling factors gamma below the thresholds are cut off, the channels corresponding to the small scaling factors gamma are essentially and directly cut off convolution kernels corresponding to the channels, and by doing so, a compact network which has fewer parameters, small occupied memory in running and low calculation amount can be obtained, namely a Prune-YOLOv-Tiny network model;
Changing the category C in the detection layer in Prune-YOLOv-Tiny network model into a single category, namely 1, simultaneously updating the values of 5 anchors in the detection layer for a fall detection dataset by using a K-means algorithm, wherein the values of the anchors are the width and the height of a prediction frame, and calculating the number R of convolution kernels in the last layer of convolution layer in Prune-YOLOv-Tiny network model by adopting a formula (1) to modify the number R into corresponding 30, wherein the number R is as follows:
R=Anchors*(5+C) (1)
Wherein, anchor is Prune-YOLOv-Tiny network model number of prediction frames, the number is 5, the obtained R is the output channel size, used for obtaining the tensor of the prediction output, generating the target frame, thus obtaining the fall detection model.
The fall detection model is essentially Prune-YOLOv2-Tiny network model, consists of 16 layers, and relates to 3 layers, namely a convolution layer (9 layers), a maximum pooling layer (6 layers) and a final detection layer (1 layer). The convolution layer plays a role in feature extraction, and the pooling layer is used for sampling and reducing the size of the feature map. For RGB images of arbitrary resolution, each pixel is divided by 255 to be converted into a [0,1] interval, scaled to 416×416 according to the aspect ratio of the original image, and the shortage is filled with 0.5. The obtained array with the size of 416×416×3 is input into a fall detection model, and the array with the size of 13×13×55 is output after detection. The number of which is the number of channels from 416 x 416 to 13 x 13.
As shown in fig. 4, training is performed on the fall detection model through the data set to obtain weights in the fall detection model, which is specifically as follows:
dividing a data set into a test set and a training set according to a ratio of 1:9, and calculating an IOU loss, a classification loss and a coordinate loss of a model prediction result and a real label on a fall detection model loaded with pre-training weights of yolov-tiny.weights provided by a official network of a YOLO network model by each training picture in the training set;
when the falling detection model is fitted, the weight of the falling detection model is saved, the learning rate is adjusted, and the next training round is started.
Further, under the same test set, the weights of the falling detection models obtained through different rounds of training are compared, the scores of the accuracy and recall rate of different weights on the test set are compared, and the weight with the highest score is selected as the human body detection weight in the final falling detection model.
S3, acquiring new judging parameters for falling through giving different thresholds to the sensitivity degree of the aspect ratio alpha and the gravity center offset d of the images before and after the falling of the human body, and judging the falling;
In the fall detection model, the aspect ratio α is calculated by using the formula (2), and is specifically as follows:
Wherein, alpha t is the length-width ratio of the human body detected by the image at the t frame, and h t is the length of the human body detected by the image at the t frame, and w t is the width of the human body detected by the image at the t frame;
the gravity center offset d is calculated by adopting a formula (3), and is specifically as follows:
dt+1=Pt-Pt+1 (3)
Wherein d t+1 is the gravity center offset of the human body detected by the image at the t+1st frame and the human body detected by the image of the previous frame, and P t and P t+1 are the gravity center point positions of the human body detected by the image at the t frame and the t+1st frame respectively, and comprise the position information of the ordinate and the abscissa;
Different thresholds are given through combination of the length-width ratio and the offset of the gravity center point, and new judging parameters for falling are obtained, so that whether the detected human body falls is judged, and the method is concretely as follows:
When alpha t is more than or equal to 1.1 and d t+1≥0.08*wt, judging that the patient falls;
In another case, when the aspect ratio of the human body detected by two continuous frames of images is larger than a threshold value, namely alpha t is larger than or equal to 1.5 and alpha t+1 is larger than or equal to 1.5, the human body is judged to fall, and the two thresholds are obtained through a large number of fall data comparison tests.
S4, as shown in FIG. 2, acquiring a real-time video as a video stream of fall detection to transmit a fall detection model, wherein the fall detection model firstly carries out human body target detection on a frame image obtained by processing an input video, and then carries out fall detection and alarm on an identified human body target;
The method comprises the steps of acquiring a real-time video as a video stream of fall detection, inputting the video stream into a fall detection model, detecting human body targets by the fall detection model on frame images obtained by processing the input video, detecting the fall of the identified human body targets, and giving an alarm if the fall of the human body in the real-time video is judged.
In example 2, compared with example 1, the data set of VOCs and COCOs on the internet is adopted, as shown in FIG. 8, in example 2, a falling-down state is not added, and although the human body features are wide because of the huge data set, the falling-down human body target can be identified without adding the falling-down state, but the human body target in the disclosed data set has an image that only hands and feet are identified as people, so that falling-down misjudgment is easily caused.
Example 3:
The device comprises a camera, a video decoding and encoding device, an edge computing chip, a cooling fan and a shell, wherein the power output of a tail wire 5V1A, a network port and a 12V2A input are reserved, the device supplies power, the cooling fan and the shell form a unified power supply circuit comprising an FPGA board, the 5V1A output in the tail wire, the 5V1A power output provides a fan power supply for chip cooling, and the network port is used for network communication data exchange. As shown in fig. 7.
The video decoding and encoding device is used for decoding and encoding the images acquired by the camera, and the edge computing chip is used for training a falling detection model and judging whether a human body falls in the video in real time.
The video decoding and encoding device is a professional SMART IP CAMERA SoC Hi3516DV300, processes images acquired by a camera into a video stream of 1920x1080@30fps, and transmits the video stream to an edge computing chip.
Training according to the disclosed human body detection data set in an edge calculation chip to obtain a falling detection model;
The edge computing chip processes the video stream transmitted by the video decoding and encoding device into a frame image and inputs a falling detection model, the falling detection model accelerates the computing of whether a human body target exists in the frame image, and calculates whether the human body target falls, if the human body of the target is judged to fall, the device transmits data through a network port connection network cable, and relatives or nearby social workers are informed through communication modes such as short messages, weChats, mailboxes and telephones, so that the purpose of timely helping falling personnel is achieved.