CN120220247A

CN120220247A - Abnormal behavior detection method, device, electronic device and non-volatile storage medium

Info

Publication number: CN120220247A
Application number: CN202510519111.1A
Authority: CN
Inventors: 俞浩; 章燕; 吴斌; 颜智
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2025-04-23
Filing date: 2025-04-23
Publication date: 2025-06-27

Abstract

The application discloses an abnormal behavior detection method, an abnormal behavior detection device, electronic equipment and a nonvolatile storage medium. The method comprises the steps of obtaining a video stream to be detected, carrying out time sequence feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, wherein the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, each image frame corresponds to one feature vector, determining attention weights corresponding to the feature vectors in the time sequence feature matrix, determining global features corresponding to the time sequence feature matrix according to the attention weights, carrying out space feature extraction on the image frames to obtain local features corresponding to the image frames, fusing the global features and the local features to obtain target features, and determining behavior types of target objects in the video stream to be detected according to the target features. The application solves the technical problem of insufficient detection precision in the detection of the abnormal behavior of the public places in the related technology.

Description

Abnormal behavior detection method and device, electronic equipment and nonvolatile storage medium

Technical Field

The present application relates to the field of computer vision application technologies, and in particular, to a method and apparatus for detecting abnormal behavior, an electronic device, and a nonvolatile storage medium.

Background

With the increasing demand for public place security management, abnormal behavior detection technology is becoming more and more important in the fields of city monitoring, traffic management, business security, and the like. Abnormal behavior detection not only can promote public safety, but also can provide important data support for city management, so that resource allocation and emergency response strategies are optimized.

Abnormal behavior detection is typically based on video surveillance data by analyzing behavior patterns in the video stream to identify abnormal activity. These techniques utilize various object detection algorithms to enable real-time processing and analysis of video frames. However, due to interference factors (such as illumination change, shielding, multi-person interaction, etc.) in a complex environment, the detection algorithm in the related technology often has technical problems of insufficient detection precision and the like in the detection of abnormal behaviors in public places.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a nonvolatile storage medium for detecting abnormal behaviors, which at least solve the technical problem that the detection precision of the related technology in the detection of the abnormal behaviors in public places is insufficient.

According to one aspect of the embodiment of the application, an abnormal behavior detection method is provided, which comprises the steps of obtaining a video stream to be detected, carrying out time sequence feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, wherein the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, each image frame corresponds to one feature vector, determining attention weights corresponding to the feature vectors in the time sequence feature matrix, determining global features corresponding to the time sequence feature matrix according to the attention weights, carrying out space feature extraction on the image frames to obtain local features corresponding to the image frames, fusing the global features and the local features to obtain target features, and determining behavior types of target objects in the video stream to be detected according to the target features.

Optionally, a target network model is adopted when feature extraction is carried out on the image frames, wherein the training step of the target network model comprises the steps of obtaining a training data set, wherein the training data set comprises a plurality of unlabeled video stream data, the video stream data comprises behavior fragments of target objects under different environmental conditions, adopting a contrast loss function, carrying out contrast learning training on the initial network model according to the training data set so as to adjust model parameters of the initial network model, wherein the contrast learning training is used for determining the similarity between behavior features corresponding to behavior fragments in different video streams in the training data set, aggregating the behavior features with the similarity larger than a first similarity threshold, keeping away the behavior features with the similarity smaller than a second similarity threshold, and setting up the correlation between the behavior features of future frames predicted by the model and the behavior features of the future frames in the video stream according to the behavior features of a plurality of historical image frames in the video stream of the training data set, and/or carrying out correlation between the behavior features of the future frames predicted by the initial network model according to the time sequence model, and setting up the correlation between the time sequence model and the initial network model, and adjusting the time sequence model according to the correlation between the time sequence model, and the time sequence model.

Optionally, after the training data set is acquired, the method further comprises the steps of generating pseudo video stream data according to real video stream data in the training data set by adopting a generator in an countermeasure network, wherein behavior categories corresponding to target objects in the pseudo video stream data comprise various abnormal behaviors, determining confidence levels of the pseudo video stream data by adopting a discriminator in the countermeasure network, using the confidence levels to represent probability that the video stream data are real data, and adding the pseudo video stream data with the confidence levels higher than a preset confidence level threshold value into the training data set.

The method comprises the steps of determining a network environment state, determining processing time required for detecting abnormal behaviors of video streams to be detected on the edge computing device and the cloud server respectively in the network environment state, carrying out abnormal behavior detection by using a target network model deployed in the edge computing device when first processing time corresponding to the edge computing device is smaller than second processing time corresponding to the cloud server, wherein the first processing time comprises time required for carrying out data preprocessing and model reasoning on the edge computing device, the second processing time comprises time required for carrying out data preprocessing and model reasoning on the cloud server and time required for carrying out data transmission between the cloud server and the edge computing device, carrying out local warning when the detected behavior type is abnormal behaviors, and uploading warning information to the cloud.

The method comprises the steps of determining behavior types detected by a target network model as pseudo tags corresponding to video streams to be detected, adding the video streams to be detected and the pseudo tags as new training data into a training data set, continuing training the target network model locally deployed by edge computing equipment by using the updated training data set, uploading model parameters updated in the training process of the target network model in each edge computing equipment to a cloud server, and summarizing and fusing the updated model parameters uploaded by different edge computing equipment by using the cloud server to obtain a new target network model, and updating the new target network model into each edge computing equipment.

Optionally, extracting spatial features of the image frame to obtain local features corresponding to the image frame includes determining gradient magnitudes corresponding to each pixel point in the image frame, wherein the gradient magnitudes are used for representing edge intensity features in the image frame, determining an edge information graph corresponding to the image frame according to the gradient magnitudes corresponding to each pixel point, wherein the edge information graph is at least used for representing outline information of a target object in the image frame, extracting features of the image frame by adopting a target network model to obtain an initial feature graph, and fusing the edge information graph and the initial feature graph to obtain local features corresponding to the image frame.

Optionally, after the video stream to be detected is acquired, the method further comprises the steps of adjusting the size of the image frame to be a target size, wherein the target size is an input size supported by a network model adopted for feature extraction of the image frame, and carrying out normalization processing on pixel values of each point in the image frame of the target size, wherein the normalization processing is used for scaling the pixel values of each point to a preset range interval.

According to another aspect of the embodiment of the application, an abnormal behavior detection device is provided, which comprises a data acquisition and feature extraction module, a local spatial feature determination module and an abnormal behavior classification prediction module, wherein the data acquisition and feature extraction module is used for acquiring a video stream to be detected and extracting time sequence features of a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, each image frame corresponds to one feature vector, the global time sequence feature determination module is used for determining attention weights corresponding to the feature vectors in the time sequence feature matrix and determining global features corresponding to the time sequence feature matrix according to the attention weights, the local spatial feature determination module is used for extracting spatial features of the image frames to obtain local features corresponding to the image frames, and the abnormal behavior classification prediction module is used for fusing the global features and the local features to obtain target features and determining behavior types of target objects in the video stream to be detected according to the target features.

According to still another aspect of the embodiment of the application, there is also provided an electronic device including a memory and a processor for running a program stored in the memory, wherein the program executes an abnormal behavior detection method when running.

According to still another aspect of the embodiments of the present application, there is also provided a non-volatile storage medium, where the non-volatile storage medium includes a stored computer program, and a device in which the non-volatile storage medium is located executes the abnormal behavior detection method by running the computer program.

According to a further aspect of embodiments of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the abnormal behavior detection method.

In the embodiment of the application, a time sequence feature matrix is obtained by acquiring a video stream to be detected and extracting time sequence features of a plurality of continuous image frames in the video stream to be detected, wherein the time sequence feature matrix comprises a plurality of feature vectors corresponding to the continuous image frames, each image frame corresponds to one feature vector, attention weights corresponding to the feature vectors in the time sequence feature matrix are determined, global features corresponding to the time sequence feature matrix are determined according to the attention weights, spatial feature extraction is carried out on the image frames to obtain local features corresponding to the image frames, the global features and the local features are fused to obtain target features, and the behavior types of target objects in the video stream to be detected are determined according to the target features.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a block diagram of a hardware structure of a computer terminal (or an electronic device) for implementing a method for abnormal behavior detection according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a method flow for detecting abnormal behavior according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an abnormal behavior detection apparatus according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Because of the interference factors (such as illumination change, shielding, multi-person interaction, etc.) in the complex environment, the behavior recognition technology in the related technology still has a plurality of problems in terms of feature extraction, target recognition and behavior detection, and the specific steps are as follows:

1) The detection precision is insufficient, and the problem of insufficient detection precision is often faced in the detection of abnormal behaviors in public places in the related technology. This is mainly because the manifestations of abnormal behavior are diverse and may be similar to normal behavior, especially in complex background and multi-objective situations, possibly leading to false or missed detection. This condition affects the accurate identification of abnormal behavior, which may lead to potential safety hazards that are not discovered and handled in time.

2) Real-time performance is a key factor in public place monitoring. The related technology has a certain defect in the aspect of processing speed, particularly in a high-resolution video stream or a large-scale video monitoring system, a deep learning model in the related technology is generally inferred at a cloud end, so that data transmission and processing time delay are high, the related technology is not suitable for an application scene with extremely high real-time requirements, delay detection of abnormal behaviors can be caused, and the effects of quick response and emergency processing are affected.

3) The model generalization capability is insufficient, namely the existing behavior detection model depends on large-scale labeling data for training, but the model has weak adaptability when facing novel abnormal behaviors due to the lack of abnormal behavior samples.

In order to solve the above problems, related solutions are provided in the embodiments of the present application, and are described in detail below.

In accordance with an embodiment of the present application, there is provided a method embodiment of abnormal behavior detection, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical sequence is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in a different order than that illustrated herein.

The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or electronic device) for implementing the abnormal behavior detection method. As shown in fig. 1, the computer terminal 10 (or electronic device) may include one or more (shown in the figures as 102a, 102b, 102 n) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data, and a transmission means 106 for communication functions. Among other things, a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or electronic device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the abnormal behavior detection method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the abnormal behavior detection method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or electronic device).

In the above operating environment, the embodiment of the present application provides an abnormal behavior detection method, and fig. 2 is a schematic diagram of a flow of a method for detecting abnormal behavior according to the embodiment of the present application, as shown in fig. 2, where the method includes the following steps:

step S202, obtaining a video stream to be detected, and extracting time sequence features of a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, wherein the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, and each image frame corresponds to one feature vector;

Step S204, determining attention weights corresponding to feature vectors in the time sequence feature matrix, and determining global features corresponding to the time sequence feature matrix according to the attention weights;

Step S206, extracting spatial features of the image frames to obtain local features corresponding to the image frames;

Step S208, fusing the global features and the local features to obtain target features, and determining the behavior category of the target object in the video stream to be detected according to the target features.

Through the steps, the aim of accurately identifying the behavior category of the target object is achieved by extracting the time sequence characteristics and the space characteristics in the video stream and combining the attention mechanism and the characteristic fusion technology, and the technical problem that the detection precision of the related technology in the detection of the abnormal behavior of the public place is insufficient is solved.

The abnormal behavior detection method in steps S202 to S208 in the embodiment of the present application is further described below.

In the embodiment of the application, the method flow can be executed and used in network element equipment, wherein the network element equipment can comprise an embedded intelligent camera, a mobile embedded system, a communication module and the like, the network element equipment can be connected through the communication module to realize data transmission and interaction, the intelligent camera can be used for acquiring a video stream to be detected, the mobile embedded system can be used for running an algorithm to realize target detection and prediction, and the image processing, target detection and communication functions can be realized through software by deploying an algorithm model corresponding to pedestrian fall detection on the mobile embedded system.

The abnormal behavior detection method in the embodiment of the application is particularly suitable for detecting the abnormal behaviors in public places, especially for abnormal behaviors which are difficult to accurately identify such as smoking, falling and the like, and can effectively improve the detection precision and the real-time performance. The following will specifically describe.

First, a smart camera may be used to obtain a video stream to be detected, for example, the smart camera captures a real-time video stream through an OpenCV library, and obtains continuous image frames, which ensures that the system can monitor the status of pedestrians in real time. The following formula is shown:

Frame_t=OpenCV.capture(t)

wherein Framet denotes an image frame captured at time t.

In this embodiment, the data enhancement and optimization may be performed on the image frames in the video stream to be detected, and specific steps are as follows.

In some embodiments of the present application, after the video stream to be detected is acquired, the method further includes the steps of adjusting the size of the image frame to a target size, where the target size is an input size supported by a network model used for feature extraction of the image frame, and performing normalization processing on pixel values of each point in the image frame of the target size, where the normalization processing is used to scale the pixel values of each point to a preset range interval.

Specifically, preprocessing each frame of image may include resizing and normalization. In the preprocessing stage, the image is resized to an acceptable size for the model (i.e., the target size may be 640 x 640 pixels, for example), as shown in the following equation:

Image_resized＝Resize(Frame_t,640*640)

next, the pixel values are scaled to between 0 and 1 through normalization processing to facilitate network processing, as shown in the following equation:

Such a preprocessing procedure ensures consistency and standardization of the image so that the image features can be better extracted later.

Then, feature extraction can be performed, and when behavior recognition is performed in a complex scene, a model based on a convolutional neural network in the related art is excellent in spatial feature extraction, but has obvious limitations in capturing time sequence features of behaviors, particularly long-time dependency. Therefore, the embodiment of the application provides a target recognition optimization method combining a transducer architecture and multi-scale feature fusion, aiming at improving the understanding capability of a model on long time sequence behaviors and the fusion expression capability of multi-scale space features, thereby obviously improving the accuracy and the robustness of behavior recognition.

In the embodiment of the application, the self-attention mechanism can be utilized to model the characteristics of different time points in the behavior sequence, so that the sensitivity of the model to the cross-frame target change is improved. Meanwhile, in order to avoid neglecting local space details, a convolutional neural network is introduced to extract local features and fused with global time sequence features, so that richer multi-scale feature expression is formed.

Specifically, firstly, extracting sequential characteristics of a plurality of continuous image frames in a video stream to be detected to obtain a group of sequential characteristic matrixes x= { X ₁,x₂,...,x_T }, and encoding each frame of image X _t into a vector form through CNN. For example, a fixed number of successive frame images (e.g., 16 frames) may be extracted from a video sequence, each frame image extracting spatial features via a convolutional neural network (e.g., resNet), resulting in an initial temporal feature matrixWhere T is the number of frames and D is the feature dimension.

Thereafter, X may be input to a transducer encoder for global modeling using a multi-headed attention mechanism. Specifically, to capture global dependencies between different moments, the input features are mapped into a query (Q), a key (K), and a value (V):

Q=XW_Q,K＝XW_K,V＝XW_V

Wherein, W _Q,W_K,W_V is a linear transformation matrix respectively.

Then, the attention weight is calculated based on the self-attention mechanism, namely the attention weight corresponding to the feature vector in the time sequence feature matrix is determined, and the specific formula is as follows:

Where d _k is the scaling factor of the vector dimension to prevent the gradient from becoming too large and Softmax operates to ensure attention weight normalization.

A Multi-Head Attention mechanism (Multi-Head Attention) was introduced to capture multiple behavioral relationships simultaneously:

MultiHead(Q,K,V)=Concat(head₁,…,head_h)W^O

Where each head represents an independent self-attention computation process that facilitates learning behavior dependencies from multiple subspaces.

Finally, a global timing characteristic F _global is obtained according to the attention weight, reflecting the global dependence of the behavior on the time dimension.

On the other hand, in order to enhance the perceptibility of the local space, the local space feature F _local (the stress image space structure information) can be extracted and then is subjected to weighted fusion with the global feature F _global to obtain the final target feature F _final＝αF_local+βF_global, wherein alpha and beta are fusion weight parameters, the contribution of alpha+beta=1 to the final recognition result can be adjusted by the two types of features, and F _final represents the final fusion feature (target feature) for classifying and recognizing the target behavior.

Specifically, in the embodiment of the present application, spatial feature extraction is performed on an image frame, and specific steps for obtaining local features are as follows.

In some embodiments of the application, extracting spatial features of an image frame to obtain local features corresponding to the image frame comprises the steps of determining gradient amplitudes corresponding to all pixel points in the image frame, wherein the gradient amplitudes are used for representing edge intensity features in the image frame, determining an edge information image corresponding to the image frame according to the gradient amplitudes corresponding to all pixel points, wherein the edge information image is at least used for representing outline information of a target object in the image frame, extracting features of the image frame by adopting a target network model to obtain an initial feature image, and fusing the edge information image and the initial feature image to obtain the local features corresponding to the image frame.

Specifically, the gradient magnitude of an image is first calculated using a target operator (e.g., sobel operator) that calculates gradients of the image in the horizontal and vertical directions by applying convolution kernels on the image, respectively, the calculation formula of the gradients being as follows:

horizontal gradient (G _x):

Wherein, the Representing the gradient of the image I in the horizontal (x-direction).

Vertical gradient (G _y):

Wherein, the Representing the gradient of the image I in the vertical (y-direction).

Then, by combining the horizontal and vertical gradients, the gradient magnitude for each pixel point is calculated, which helps determine the edge intensity in the image. The calculation formula of the gradient amplitude is as follows:

Where G represents the gradient magnitude, and G _x and G _y are gradient values in the horizontal and vertical directions, respectively. According to the gradient amplitude value corresponding to each pixel point, an edge information graph (gradient amplitude value graph) corresponding to the image frame is determined, and the graph can highlight edge features in the image, particularly areas with obvious intensity change, which is very important for identifying contours and contact points in the tumbling detection.

Then, the edge information map generated by the above steps may be fused with the original feature map originally extracted from the image frame as a part of the feature map to obtain the local feature.

The embodiment of the application effectively integrates the global time sequence modeling capability and the local space perception capability, and is suitable for multi-scene and multi-scale behavior detection tasks. The long-term dependency relationship among behaviors is modeled through a self-attention mechanism, the loss of spatial characteristics is compensated by CNN, the optimization fusion of characteristic levels is realized, and the method has extremely high engineering popularization value and technical innovation in actual scenes.

The following further describes a target network model and a training process thereof adopted when the image frame is subjected to feature extraction in the embodiment of the application.

In the behavior detection task, feature extraction is a core link of the whole detection process, and the accuracy of subsequent target identification and classification is determined. However, the network model for feature extraction in the related art needs to rely on a large amount of labeling data for training, and the labeling process is generally time-consuming and labor-consuming, especially in the task of behavior detection, the behavior difference under different scenes is large, so that the generalization capability of manual labeling is low. Therefore, the embodiment of the application provides a feature extraction method based on Self-supervision learning (SSL-Supervised Learning) for model training and application, and behavior features can be automatically learned in a label-free data environment, so that dependence on manual labeling is reduced, and the method is concretely as follows.

In some embodiments of the application, a target network model is adopted when feature extraction is carried out on image frames, wherein the training step of the target network model comprises the steps of acquiring a training data set, wherein the training data set comprises a plurality of unlabeled video stream data, the video stream data comprises behavior fragments of target objects under different environmental conditions, carrying out contrast learning training on an initial network model according to the training data set by adopting a contrast loss function so as to adjust model parameters of the initial network model, wherein the contrast learning training is used for determining the similarity between behavior features corresponding to behavior fragments in different video streams in the training data set, aggregating behavior features with the similarity greater than a first similarity threshold, and keeping away the behavior features with the similarity less than a second similarity threshold, the first similarity threshold is greater than the second similarity threshold, and/or predicting the behavior features of a next future frame corresponding to a plurality of historical image frames according to the behavior features in the video stream of the training data set, determining the behavior features of the future frame predicted by the model and determining the behavior features of the future frame in the video stream, and carrying out association between the prediction model and the initial network model according to the time sequence map, and setting up an association between the time-sequence map of the prediction model and the initial network model.

Specifically, a large amount of unlabeled video data is firstly collected as a training data set, and the training data set comprises behavior fragments under different indoor, outdoor, different angles and different illumination conditions. In this embodiment, data enhancement may be performed by using technologies such as inter-frame difference and optical flow analysis, so as to ensure diversity of data.

And then, the model can be subjected to contrast learning training by utilizing the data in the training data set.

Specifically, contrast learning (Contrastive Learning) learns the effective feature representation by maximizing similarity between positive samples, minimizing similarity between negative samples. In the present embodiment, a pair of behavior video clips x _i and x _j is given, if they belong to the same behavior class, as a positive sample pair, otherwise as a negative sample pair.

The contrast loss function (Contrastive Loss) is defined as follows:

Wherein z _i,z_j is the eigenvector of the input video segment, sim (·) represents the cosine similarity function, τ is the temperature coefficient for adjusting the gradient amplitude of the contrast loss, and the denominator term represents all possible sample pairs to ensure the distinguishability of the feature distribution.

The feature extraction training is carried out by adopting the contrast loss function, the model can autonomously construct the behavior feature representation space through contrast learning, so that the features of similar behaviors are aggregated, the features of different behaviors are far away, the contrast loss is optimized through temperature parameter adjustment, the degree of distinction of the model on the behavior features is improved, and the feature expression capability is improved.

In this embodiment, the model may be further trained by time series modeling of future frame predictions.

In particular, behavior detection typically involves time series data, and to further optimize feature expression, embodiments of the present application may introduce future frame prediction tasks. The method can predict the frame characteristics of the future moment X _t+1 based on the input group of video frames X _t, so that modeling capability of the model on time sequence information is improved.

Future frame prediction tasks may be optimized by minimizing the following loss functions:

wherein F (X _t) is the next frame feature of model prediction, X _t+1 is the real next frame feature, and I.I ² represents Euclidean distance and is used for measuring prediction errors.

The time sequence associated information can be learned by the model through future frame prediction, so that the understanding capability of behavior characteristics is improved. The prediction capability is optimized through the mean square error loss, so that the model can infer future behaviors more accurately.

In addition, the model may be further trained in connection with graph rolling network (GCN) modeling relationships between behavioral features.

In particular, since different features in behavior detection may have complex spatiotemporal correlations, embodiments of the present application may further introduce a graph convolution network (Graph Convolutional Network, GCN) to model relationships between behavior features using graph structures.

The calculation process of the GCN is shown in the following formula:

Z^′=σ(WZ+b)

Wherein Z is the original feature representation, Z ^′ is the optimized feature representation, W and b are trainable parameters, and sigma (·) sigma is a nonlinear activation function (e.g., reLU).

The behavior characteristics of different time steps are used as graph nodes, a time sequence association graph structure is established, the association between the characteristics is extracted by adopting the GCN, and the model can automatically learn the association between the different behavior characteristics, so that the expression capability of the behavior characteristics of the complex behavior mode is improved.

On the other hand, in a practical scene, a problem of sample scarcity is often faced to a task of monitoring violations or dangerous behaviors such as falling, smoking, abnormal stay and the like. Because these behaviors are random, blind, and costly to collect and annotate in a real environment, it is often difficult for a training set to obtain a sufficient number of abnormal samples. Such sample imbalance can lead to model overfitting to conventional behavior during training, thereby reducing recognition accuracy and robustness to abnormal behavior.

In order to solve the above problems, the embodiment of the present application may introduce generation of an countermeasure Network (GAN, generative Adversarial Network) for data enhancement, construct a forgery behavior sample, and expand a training set, and under the countermeasure training mechanism, a GAN generator generates a "pseudo sample" with high similarity by learning the real sample distribution, and a discriminator continuously optimizes the discrimination capability for the real and the pseudo samples. In the two game processes, the quality of the generated samples is continuously improved, the problem of insufficient quantity of abnormal samples can be effectively solved, and the generalization capability and stability of the model in actual deployment are improved. Specifically, the following is described.

In some embodiments of the application, after the training data set is acquired, the method further comprises the steps of generating pseudo video stream data according to real video stream data in the training data set by adopting a generator in the countermeasure network, wherein behavior categories corresponding to target objects in the pseudo video stream data comprise various abnormal behaviors, determining confidence levels of the pseudo video stream data by adopting a discriminator in the countermeasure network, wherein the confidence levels are used for representing probability that the video stream data are real data, and adding the pseudo video stream data with the confidence levels higher than a preset confidence level threshold value into the training data set.

Specifically, the countermeasure network is composed of two parts, a Generator (G) for sampling from the noise distribution to generate a falsified sample, and a discriminator (Discriminator, D) for determining whether the input sample is real data or generated data.

In this embodiment, the optimization objective function of GAN is the minimum maximum problem:

Wherein x-P _data represent real sample distribution, z-P _z represent noise distribution (such as Gaussian distribution or uniform distribution), G (z) is a pseudo sample output by a generator, and D (x) is a confidence output of a discriminator on whether input is a real sample. The goal is to minimize the loss function for generator G (even if the arbiter cannot distinguish between true and false) and maximize the loss function for arbiter D (i.e., correctly sort true samples from false samples).

In order to enhance the authenticity and usability of behavior samples, the embodiment of the application can introduce the following strategies of 1) Conditional constraint generation (Conditional GAN) of adding a behavior label y on the basis of an input noise vector z to guide the generation of specific types of behaviors such as smoking, falling, running and the like, and 2) a pseudo sample screening mechanism of filtering generated samples by utilizing a trained discriminator, wherein only samples with high confidence (such as D (G (z)) > 0.8) are reserved for training. 3) And (3) real data mixed training, namely combining the high-quality pseudo sample and the original real sample to form an extended data set, and inputting the extended data set into a subsequent detection model to improve the robustness.

Specifically, during the data enhancement challenge training phase, the discriminators may be trained using real samples, and the generator may be trained using random noise, with the generator learning a feature distribution approximating the abnormal behavior samples as training progresses. And then, screening high-confidence pseudo samples from the generator, and mixing the samples with the original abnormal behavior samples to form a new training set. The behavioral abnormal behavior detection model is retrained using the enhanced dataset.

According to the embodiment of the application, by introducing a behavior pseudo-sample generation mechanism based on a generation countermeasure network, the effective expansion of rare abnormal samples is realized, the recognition accuracy and robustness of a detection model in a small sample scene are obviously improved, the method is suitable for various complex scenes such as abnormal detection and behavior recognition, and has strong adaptability and wide engineering application prospect.

In the embodiment of the application, real-time detection optimization can be performed based on edge calculation, model reasoning and behavior recognition tasks are sunk to the edge end for processing, cloud computing pressure is reduced, network transmission delay is reduced, and overall real-time response capability of the system is improved. Meanwhile, a lightweight neural network structure is designed, and a distributed reasoning scheduling strategy is matched, so that the occupation of computing resources is effectively reduced on the premise of ensuring the detection precision, and the efficient, rapid and intelligent detection system deployment is realized, and the method can be applied to terminal equipment such as intelligent cameras, embedded chips, edge servers and the like, and is specifically as follows.

In some embodiments of the application, the target network model is deployed on the edge computing device and the cloud server, the method further comprises determining a network environment state, determining processing time required for detecting abnormal behaviors of the video stream to be detected on the edge computing device and the cloud server respectively in the network environment state, performing abnormal behavior detection by adopting the target network model deployed in the edge computing device when a first processing time corresponding to the edge computing device is smaller than a second processing time corresponding to the cloud server, wherein the first processing time comprises time required for data preprocessing and model reasoning on the edge computing device, the second processing time comprises time required for data preprocessing and model reasoning on the cloud server and time required for data transmission between the cloud server and the edge computing device, and performing local alarm and uploading alarm information to the cloud when the detected behavior type is abnormal behavior.

Specifically, in order to adapt to the limitation of the computing capability of the edge computing device, the size of the target network model can be reduced by using a model pruning and quantization technology, and inference Graph optimization (Graph Fusion) and intermediate tensor multiplexing are introduced, so that the execution efficiency of the model is improved, the target network model is converted into a deployment format of the edge computing device, and the smooth operation of the target network model on a terminal with limited resources is ensured. And then, the compressed target network model can be deployed to terminal equipment such as intelligent cameras, edge AI boxes and the like in a construction site or a nursing home, so that on-site video analysis and reasoning can be realized.

In the process of the edge cloud collaborative task scheduling, the system dynamically selects an edge end or a cloud end to infer by analyzing the current network condition and the calculation load:

T_total＝min(T_edge,T_cloud)

The cloud server processing method comprises the steps of enabling a task to be independently processed by edge computing equipment if T _edge is first processing time, enabling T _cloud to be second processing time, and enabling the task to be sent to the cloud server for processing if T _edge<T_cloud is achieved.

The edge computing equipment processes the video stream in real time, and when continuous tumbling behavior characteristics are detected, alarms (such as audible and visual alarms, short message pushing and the like) are triggered locally immediately without depending on cloud judgment. And at the same time of alarming, the device can upload relevant fragments and identification results to a management platform for further auditing or event archiving, so as to realize edge-cloud collaborative management.

According to the embodiment of the application, the edge computing architecture is introduced, and the lightweight model design and the dynamic scheduling mechanism are combined, so that the high-efficiency behavior detection capability on the computing power limited equipment is realized. The system not only meets the monitoring requirement of high real-time performance, but also remarkably improves the self-adaptability and response speed of the system.

In addition, potential knowledge in unlabeled data can be mined through a pseudo tag mechanism, and model knowledge sharing and fusion among all edge nodes can be realized by combining federal learning without exposing original data, so that privacy safety is guaranteed, and robustness and adaptability of the whole detection system are improved.

In some embodiments of the application, the method further comprises determining behavior types detected by the target network model as pseudo tags corresponding to the video streams to be detected, adding the video streams to be detected and the pseudo tags as new training data into a training data set, continuing training the target network model locally deployed by the edge computing devices by using the updated training data set, uploading model parameters updated in the training process of the target network model in each edge computing device to a cloud server, and summarizing and fusing the updated model parameters uploaded by different edge computing devices by using the cloud server to obtain a new target network model, and updating the new target network model into each edge computing device.

Specifically, in unlabeled video data, a predictive label is generated as a "pseudo label" by using the current model and incorporated into a subsequent training process to continuously enhance the adaptation of the model to the new environment. Meanwhile, in order to avoid directly transmitting video image data, the system adopts a federal learning strategy, performs local model training at different deployment points (such as different buildings, areas or terminal equipment), periodically gathers model parameters to a central server, and forms a new global model through aggregation optimization.

Wherein the federal average (FedAvg) update formula is as follows:

Wherein w ^t represents model parameters of the t-th iteration, K represents the number of clients participating in training, N _i represents the number of samples of the client i, and N is the total number of samples; represents the local gradient of the ith client, and eta is the learning rate. The method avoids transmitting sensitive original data, only exchanges the encrypted gradient or model weight, and takes the data security and the model collaborative training into consideration.

In this embodiment, light-weighted target network models can be deployed in a plurality of different monitoring environments (such as a nursing home, a mall and an office), unlabeled video data collected by each deployment terminal (edge computing device) is generated into a pseudo tag through an existing model, the model is continuously trained locally by combining a small number of labeled samples, and model weights updated in the training process are uploaded to a central (cloud) server by each edge node every fixed period without transmitting original videos. The central server gathers the weight information from each node, adopts FedAvg algorithm to aggregate into a new global model, and sends the new global model to all devices after updating. Each edge node deploys and operates the new model, and continues to perform pseudo tag learning and local fine tuning to form a closed-loop self-adaptive optimization mechanism. Through the mechanism, the system realizes rapid adaptation and continuous optimization under different scenes, the detection accuracy is improved by more than 10% on average, and especially the robustness is obviously enhanced under complex illumination or shielding environments.

In order to further improve the privacy protection capability of the system in the multi-terminal collaborative learning process, as an optional implementation manner, a differential privacy (DIFFERENTIAL PRIVACY) mechanism may be further introduced. Before the edge node uploads the model parameters or gradients, a proper amount of noise is added, so that the original data content cannot be reversely deduced even if an attacker acquires intermediate parameter information. The scheme can strengthen the practicability of the system in the scene of extremely high requirements on data privacy, such as government enterprises, medical treatment, senior citizens and the like, and simultaneously gives consideration to the continuous optimizing capability of the detection model.

Meanwhile, aiming at the problem of the reduction of the generalization capability of the model in a new environment, a rapid adaptation module based on federal migration learning can be further constructed. When the system is deployed to a brand new scene (such as a new building and a new camera angle), the local terminal is allowed to acquire a migration template from the central server, and efficient migration can be realized only by a small amount of target domain data. By the method of freezing part of the universal layer and only fine-tuning the specific identification layer, the method is quickly adapted to a new environment on the basis of not damaging the original identification capability, and the deployment efficiency and the practicability are improved.

In addition, in order to cope with a monitoring environment of multi-sensor fusion in reality, as an alternative implementation manner, a pseudo tag generation mechanism supporting multi-mode input of images, audio, IMU sensor data and the like can be designed. For example, through audio frequency recognition sudden sound, the impact of falling captured by the vibration sensor, the action behavior is detected to the camera, and the three jointly generate more reliable pseudo-label to promote the pseudo-supervision learning quality of model. The scheme can greatly improve the perception capability and stability of the model under the conditions of shielding, light difference, limited viewing angle and the like.

According to the embodiment of the application, an embodiment of an abnormal behavior detection device is also provided. Fig. 3 is a schematic structural diagram of an abnormal behavior detection apparatus according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:

The data acquisition and feature extraction module 30 is configured to acquire a video stream to be detected, and perform time sequence feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, where the time sequence feature matrix includes feature vectors corresponding to the plurality of continuous image frames, and each image frame corresponds to one feature vector;

The global time sequence feature determining module 32 is configured to determine attention weights corresponding to feature vectors in the time sequence feature matrix, and determine global features corresponding to the time sequence feature matrix according to the attention weights;

The local spatial feature determining module 34 is configured to perform spatial feature extraction on the image frame to obtain a local feature corresponding to the image frame;

The abnormal behavior classification prediction module 36 is configured to fuse the global feature and the local feature to obtain a target feature, and determine a behavior class of a target object in the video stream to be detected according to the target feature.

Optionally, after the training data set is acquired, the abnormal behavior detection device is further used for generating pseudo video stream data according to real video stream data in the training data set by adopting a generator in the countermeasure network, wherein behavior categories corresponding to target objects in the pseudo video stream data comprise various abnormal behaviors, determining confidence levels of the pseudo video stream data by adopting a discriminator in the countermeasure network, using the confidence levels to represent probability that the video stream data are real data, and adding the pseudo video stream data with the confidence levels higher than a preset confidence level threshold value into the training data set.

The method comprises the steps of enabling a target network model to be deployed in edge computing equipment and a cloud server, enabling an abnormal behavior detection device to be further used for determining network environment states, determining processing time required for detecting abnormal behaviors of video streams to be detected in the edge computing equipment and the cloud server respectively in the network environment states, enabling the target network model deployed in the edge computing equipment to detect the abnormal behaviors when first processing time corresponding to the edge computing equipment is smaller than second processing time corresponding to the cloud server, enabling the first processing time to comprise time required for data preprocessing and model reasoning on the edge computing equipment, enabling the second processing time to comprise time required for data preprocessing and model reasoning on the cloud server and time required for data transmission between the cloud server and the edge computing equipment, enabling local warning to be carried out when detected behavior types are abnormal behaviors, and enabling warning information to be uploaded to the cloud.

The abnormal behavior detection device is further used for determining behavior types detected by the target network model as pseudo tags corresponding to the video streams to be detected, adding the video streams to be detected and the pseudo tags as new training data into a training data set, continuing training the target network model locally deployed by the edge computing equipment by using the updated training data set, uploading model parameters updated in the training process of the target network model in each edge computing equipment to a cloud server, and summarizing and fusing the updated model parameters uploaded by different edge computing equipment by using the cloud server to obtain a new target network model, and updating the new target network model into each edge computing equipment.

Optionally, after the video stream to be detected is acquired, the data acquisition and feature extraction module 30 is further configured to adjust the size of the image frame to a target size, where the target size is an input size supported by a network model used for feature extraction of the image frame, and perform normalization processing on pixel values of each point in the image frame of the target size, where the normalization processing is used to scale the pixel values of each point to a preset range.

The respective modules in the abnormal behavior detection apparatus may be program modules (for example, a set of program instructions for realizing a specific function), or may be hardware modules, and the latter may be expressed in the form of, but not limited to, a single processor, or the functions of the respective modules may be realized by a single processor.

It should be noted that, the abnormal behavior detection apparatus provided in the present embodiment may be used to execute the abnormal behavior detection method shown in fig. 2, so the explanation about the abnormal behavior detection method is also applicable to the embodiment of the present application, and is not repeated here.

The embodiment of the application also provides a nonvolatile storage medium which comprises a stored computer program, wherein equipment of the nonvolatile storage medium executes the abnormal behavior detection method by running the computer program, wherein the equipment acquires a video stream to be detected, performs time sequence feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, wherein the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, each image frame corresponds to one feature vector, determines attention weights corresponding to the feature vectors in the time sequence feature matrix, determines global features corresponding to the time sequence feature matrix according to the attention weights, performs spatial feature extraction on the image frames to obtain local features corresponding to the image frames, fuses the global features and the local features to obtain target features, and determines behavior types of target objects in the video stream to be detected according to the target features.

The embodiment of the application also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor to realize the steps of the abnormal behavior detection method in each embodiment of the application, namely, acquiring a video stream to be detected, carrying out time sequence feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time sequence feature matrix, wherein the time sequence feature matrix comprises feature vectors corresponding to the plurality of continuous image frames, each image frame corresponds to one feature vector, determining the attention weight corresponding to the feature vectors in the time sequence feature matrix, determining the global feature corresponding to the time sequence feature matrix according to the attention weight, carrying out space feature extraction on the image frames to obtain local features corresponding to the image frames, fusing the global features and the local features to obtain target features, and determining the behavior category of a target object in the video stream to be detected according to the target features.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. The storage medium includes a U disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, etc. which can store the program code.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A method for detecting abnormal behavior, comprising:

Acquire a video stream to be detected, and perform temporal feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a temporal feature matrix, wherein the temporal feature matrix contains feature vectors corresponding to the plurality of continuous image frames, and each image frame corresponds to one feature vector;

Determine the attention weight corresponding to the feature vector in the time series feature matrix, and determine the global feature corresponding to the time series feature matrix based on the attention weight;

Extracting spatial features from the image frame to obtain local features corresponding to the image frame;

The global feature and the local feature are integrated to obtain a target feature, and the behavior category of the target object in the video stream to be detected is determined based on the target feature.

2. The abnormal behavior detection method according to claim 1, characterized in that a target network model is used when extracting features from the image frame, wherein the training step of the target network model comprises:

Acquire a training data set, wherein the training data set includes a plurality of unlabeled video stream data, and the video stream data includes behavior segments of a target object under different environmental conditions;

Using a contrast loss function, based on the training data set, the initial network model is subjected to contrast learning training to adjust the model parameters of the initial network model, wherein the contrast learning training is used to determine the similarity between the behavior features corresponding to the behavior segments in different video streams in the training data set, and to aggregate the behavior features whose similarity is greater than a first similarity threshold, and to distance the behavior features whose similarity is less than a second similarity threshold, wherein the first similarity threshold is greater than the second similarity threshold;

And/or, using the initial network model, based on the behavior characteristics of multiple historical image frames in the video stream of the training data set, predicting the behavior characteristics of the next future frame corresponding to the multiple historical image frames, and determining the prediction error value between the behavior characteristics of the future frame predicted by the model and the behavior characteristics of the actual future frame in the video stream, and adjusting the model parameters of the initial network model based on the prediction error value;

And/or, establish a time series association graph between graph nodes corresponding to the behavior features at different time steps in the training data set, and adjust the model parameters of the initial network model based on the association relationship characteristics between the graph nodes in the time series association graph.

3. The abnormal behavior detection method according to claim 2, characterized in that after obtaining the training data set, the method further comprises:

Using a generator in an adversarial network, based on real video stream data in the training data set, pseudo video stream data is generated, wherein the behavior categories corresponding to the target object in the pseudo video stream data include various abnormal behaviors;

Using the discriminator in the adversarial network, determine the confidence of the pseudo video stream data, and use the confidence to characterize the probability that the video stream data is real data;

The pseudo video stream data whose confidence level is higher than a preset confidence threshold is added to the training data set.

4. The abnormal behavior detection method according to claim 2, characterized in that the target network model is deployed on an edge computing device and a cloud server; the method further comprises:

Determine a network environment state, and determine, under the network environment state, a processing time required for performing abnormal behavior detection on the video stream to be detected on the edge computing device and the cloud server respectively;

In a case where a first processing duration corresponding to the edge computing device is less than a second processing duration corresponding to the cloud server, the target network model deployed in the edge computing device is used to perform abnormal behavior detection, wherein the first processing duration includes: the time required for data preprocessing and model reasoning on the edge computing device, and the second processing duration includes: the time required for data preprocessing and model reasoning on the cloud server, and the time required for data transmission between the cloud server and the edge computing device;

When the detected behavior category is abnormal behavior, a local alarm is issued and the alarm information is uploaded to the cloud.

5. The abnormal behavior detection method according to claim 4, characterized in that the method further comprises:

Determine the behavior category detected by the target network model as a pseudo label corresponding to the video stream to be detected, and add the video stream to be detected and the pseudo label as new training data to the training data set;

Using the updated training data set, continue training the target network model locally deployed on the edge computing device;

Uploading the model parameters updated during the training of the target network model in each edge computing device to the cloud server;

The cloud server is used to aggregate and merge the updated model parameters uploaded by different edge computing devices to obtain a new target network model, and the new target network model is updated to each edge computing device.

6. The abnormal behavior detection method according to claim 1, characterized in that the spatial feature extraction is performed on the image frame to obtain the local features corresponding to the image frame, including:

Determining a gradient magnitude corresponding to each pixel point in the image frame, wherein the gradient magnitude is used to characterize an edge intensity feature in the image frame;

Determining an edge information map corresponding to the image frame according to the gradient amplitude corresponding to each of the pixel points, wherein the edge information map is at least used to represent contour information of the target object in the image frame;

The target network model is used to extract features from the image frame to obtain an initial feature map, and the edge information map and the initial feature map are fused to obtain the local features corresponding to the image frame.

7. The abnormal behavior detection method according to claim 1, characterized in that after obtaining the video stream to be detected, the method further comprises:

Adjusting the size of the image frame to a target size, wherein the target size is an input size supported by a network model used for extracting features from the image frame;

The pixel value of each point in the image frame of the target size is normalized, wherein the normalization is used to scale the pixel value of each point to a preset range.

8. An abnormal behavior detection device, comprising:

A data acquisition and feature extraction module, used to acquire a video stream to be detected, and perform time series feature extraction on a plurality of continuous image frames in the video stream to be detected to obtain a time series feature matrix, wherein the time series feature matrix contains feature vectors corresponding to the plurality of continuous image frames, and each image frame corresponds to one feature vector;

A global temporal feature determination module, used to determine the attention weight corresponding to the feature vector in the temporal feature matrix, and determine the global feature corresponding to the temporal feature matrix based on the attention weight;

A local spatial feature determination module, used to extract spatial features from the image frame to obtain local features corresponding to the image frame;

The abnormal behavior classification prediction module is used to fuse the global features and the local features to obtain target features, and determine the behavior category of the target object in the video stream to be detected based on the target features.

9. An electronic device, characterized in that it comprises: a memory and a processor, wherein the processor is used to run a program stored in the memory, wherein the abnormal behavior detection method according to any one of claims 1 to 7 is executed when the program is run.

10. A non-volatile storage medium, characterized in that the non-volatile storage medium includes a stored computer program, wherein a device where the non-volatile storage medium is located executes the abnormal behavior detection method described in any one of claims 1 to 7 by running the computer program.

11. A computer program product, comprising a computer program, characterized in that when the computer program is executed by a processor, the steps of the abnormal behavior detection method according to any one of claims 1 to 7 are implemented.