WO2022160591A1

WO2022160591A1 - Crowd behavior detection method and apparatus, and electronic device, storage medium and computer program product

Info

Publication number: WO2022160591A1
Application number: PCT/CN2021/103579
Authority: WO
Inventors: 韩志伟; 刘诗男; 杨昆霖; 侯军; 伊帅
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2021-01-26
Filing date: 2021-06-30
Publication date: 2022-08-04
Also published as: CN112800944B; KR20230090344A; CN112800944A

Abstract

Provided are a crowd behavior detection method and apparatus, and an electronic device, a storage medium and a computer program product. The method may comprise: performing object tracking on at least one object appearing in a target image sequence that includes a plurality of objects, and determining location change information of each object in the target image sequence; and performing image convolution processing on the basis of the location change information obtained in the target image sequence, and on the basis of extraction features obtained by means of the image convolution processing, determining crowd behaviors corresponding to the plurality of objects in the target image sequence.

Description

Crowd behavior detection method and device, electronic device, storage medium and computer program product

CROSS-REFERENCE TO RELATED APPLICATIONS

This patent application claims the priority of the Chinese patent application filed on January 26, 2021 with the application number of 2021101062857 and the invention titled "Crowd Behavior Detection Method and Device, Electronic Equipment and Storage Medium", which is incorporated by reference incorporated into the text.

technical field

The present application relates to computer technology, and in particular to a crowd behavior detection method and device, electronic equipment, storage medium and computer program product.

Background technique

With the advancement of urbanization, crowds are becoming more and more concentrated, so it is very important for pedestrian safety to identify whether or not there is abnormal behavior in the crowd. If the abnormal behavior of the crowd can be accurately identified, and the abnormal behavior can be stopped and prevented, the probability of dangerous events can be reduced.

For example, in a security scene, a target image sequence (video sequence) including pedestrians can be captured by an image capturing device (eg, a monitoring device). If it is determined that the pedestrian behavior occurring in the target image sequence belongs to abnormal behaviors such as pedestrian gathering, pedestrian stasis, etc., crowd evacuation can be arranged immediately to avoid such events as stampede or group viciousness. It can be seen that there is an urgent need to propose a method for detecting crowd behavior in target image sequences.

SUMMARY OF THE INVENTION

In view of this, the present application discloses at least one method for detecting crowd behavior. The method includes: performing object tracking on at least one object appearing in a target image sequence including multiple objects, and determining the position of each object in the target image sequence. Change information; perform graph convolution processing based on the above position change information obtained in the above target image sequence, and determine crowd behaviors corresponding to a plurality of the above objects in the above target image sequence based on the extracted features obtained by the above graph convolution.

In some of the illustrated embodiments, performing object tracking on at least one object appearing in a target image sequence including multiple objects, and determining the position change information of each object in the target image sequence includes: tracking the target image Each image included in the sequence is subjected to image processing to determine the position information of each of the above-mentioned objects in the corresponding image; object tracking is performed on each of the above-mentioned objects to determine, based on the tracking result and the above-mentioned position information, that each of the above-mentioned objects is in the above-mentioned position. Position change information in the target image sequence.

In some of the illustrated embodiments, the above-mentioned performing object tracking on each of the above-mentioned objects, so as to determine the position change information of each of the above-mentioned objects in the above-mentioned target image sequence based on the tracking result and the above-mentioned position information, includes: using Kalman filtering An algorithm or an object detection model performs object tracking on each of the above objects; based on the tracked position information of the same object in the corresponding image, the position change information of each of the above objects is determined.

In some of the illustrated embodiments, performing graph convolution processing based on the position change information obtained in the target image sequence to obtain crowd behaviors corresponding to the plurality of objects in the target image sequence, including: based on the position change The object position information in the at least one image included in the above-mentioned target image sequence represented by the information and the connection relationship between the objects in the above-mentioned at least one image are performed on the above-mentioned at least one image respectively. Image features; perform time-domain convolution processing on the image features corresponding to the at least one image respectively, and determine crowd behaviors corresponding to a plurality of the above-mentioned objects in the above-mentioned target image sequence based on the extracted features obtained by the above-mentioned time-domain convolution processing; wherein, The above crowd behavior at least includes at least one of the following: pedestrians gather; pedestrians are scattered; pedestrians stay; pedestrians reverse flow.

In some of the illustrated embodiments, the object position information in at least one image included in the above-mentioned target image sequence represented by the above-mentioned position change information, and the connection relationship between objects in the above-mentioned at least one image, for the above-mentioned at least one image Performing spatial graph convolution processing respectively to obtain graph features corresponding to at least one image respectively, including: determining an adjacency matrix corresponding to the at least one image based on the connection relationship between objects in the at least one image; based on the object position information, Determine the feature matrix corresponding to the at least one image respectively; complete the spatial graph convolution processing based on the adjacency matrix and the feature matrix to obtain the graph feature corresponding to each image.

In some of the illustrated embodiments, before the step of performing graph convolution processing based on the position change information obtained in the target image sequence to obtain the extraction features corresponding to the target image sequence, the method further includes: determining the target image sequence The connection relationship between any two objects contained in at least one of the included images.

In some of the illustrated embodiments, determining the connection relationship between any two objects included in at least one image included in the target image sequence includes: extracting where in the image at least one object included in the at least one image is located The image feature corresponding to the area; the image feature represents the image information of the location of the at least one object; based on the image feature corresponding to the at least one object, the similarity between any two objects in the at least one object is determined; The two objects corresponding to the similarity of the preset threshold are determined as two objects having a connection relationship.

In some of the illustrated embodiments, determining the connection relationship between any two objects included in at least one image included in the target image sequence includes: performing image processing on the at least one image, respectively, and determining that the object is in at least one image. The position information in the image; the distance between any two objects in the at least one object is determined based on the position information corresponding to the at least one object; the connection relationship between any two objects included in the at least one image is determined based on the above distance.

In some of the illustrated embodiments, determining the connection relationship between any two objects included in the at least one image based on the above distance includes: mapping the determined distance between any two objects to a value determined by a third preset threshold and the interval formed by the fourth preset threshold; the distance between any two objects after mapping is determined as the connection weight between the above-mentioned any two objects; the above-mentioned arbitrary two objects are indicated by the connection weight between the above-mentioned any two objects A connection relationship between two objects.

In some of the illustrated embodiments, the graph convolution processing is implemented by a graph convolution classification model; wherein, the training method of the graph convolution classification model includes: generating a training sample, wherein the training sample has a Position change information, and label information with crowd behavior based on the position change information of the multiple objects; train a preset graph convolution model based on the position change information and the crowd behavior label information, and obtain the graph convolution score. class model.

In some of the illustrated embodiments, the above-mentioned generating a training sample includes: setting motion patterns corresponding to multiple objects based on a motion simulation platform; determining position change information corresponding to at least one object based on the motion pattern; determining the at least one object The crowd behavior represented by the corresponding position change information; the above training sample is generated based on the above position change information and the crowd behavior represented by the above position change information.

The present application further discloses a crowd behavior detection device, the device includes: a position change information determination module, based on the object tracking result of at least one object appearing in a target image sequence including multiple objects, determine that each object is in the target image position change information in the sequence; a crowd behavior detection module, configured to perform graph convolution processing based on the above position change information obtained in the above target image sequence, and to determine the multi-dimensional data in the above target image sequence based on the extracted features obtained by the above graph convolution The crowd behavior corresponding to the above objects.

The present application also discloses an electronic device, the device includes: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the executable instructions stored in the memory to implement the aforementioned crowd behavior Detection method.

The present application also discloses a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is used to execute the foregoing crowd behavior detection method.

The present application also discloses a computer program product, which, when the computer program product runs on a computer, enables the computer to execute the aforementioned method for detecting crowd behavior.

In the present application, by performing object tracking on the objects appearing in the target image sequence, the position change information of the above-mentioned objects in the above-mentioned target image sequence is determined. Then, graph convolution processing is performed based on the above position change information to obtain extracted features corresponding to the above target image sequence, and based on the above extracted features, crowd behaviors corresponding to the plurality of above objects in the above target image sequence are determined. In this way, the principle of graph convolution is used to determine the extraction features that are beneficial to the detection of crowd behavior from the target image sequence, so as to realize the accurate detection of crowd behavior represented by the target image sequence.

Description of drawings

Fig. 1 is the method flow chart of a kind of target image sequence classification method shown in this application;

2 is a schematic flow chart of a crowd behavior detection shown in the application;

3 is a flowchart of a method for determining a connection relationship of objects in an image shown in the application;

4 is a schematic diagram of a graph convolution processing flow diagram shown in the application;

Fig. 5 is a kind of classification flow schematic diagram shown in this application;

6 is a schematic diagram of a video sequence classification flow diagram shown in this application;

Fig. 7 is the method flow chart of a kind of model training method shown in this application;

8 is a schematic structural diagram of a crowd behavior detection device shown in the application;

FIG. 9 is a schematic diagram of a hardware structure of an electronic device shown in this application.

Detailed ways

Exemplary embodiments will be described in detail below, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as recited in the appended claims.

The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a," "above," and "the" are intended to include plural forms as well, unless the above clearly dictates otherwise. It should be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if", as used herein, can be interpreted as "at the time of" or "when" or "in response to determining", depending on the context.

This application aims to propose a crowd behavior detection method (hereinafter referred to as detection method). The method utilizes the principle of graph convolution, and based on the position change information corresponding to each object appearing in the target image sequence, from the above target image sequence, obtains an extraction useful for determining the crowd behavior corresponding to a plurality of the above objects in the above target image sequence. feature. Then, the method can continue to classify based on the above-mentioned extracted features, so as to determine the crowd behaviors corresponding to the above-mentioned objects in the above-mentioned target image sequence.

For example, in a special scene, the above-mentioned target image sequence may be a video sequence collected by monitoring; the above-mentioned object may be a pedestrian appearing in the above-mentioned target image sequence. The above-mentioned types of crowd behavior may include pedestrian gathering, pedestrian stasis, and pedestrian dispersion. Through the above method, the principle of graph convolution can be used to determine the extracted features that can be beneficial for determining crowd behavior based on the position change information of pedestrians in the video. Then, classification is performed based on the above-mentioned extracted features, so as to determine the crowd behavior that is occurring in the video sequence, and make corresponding arrangements according to the determined crowd behavior to reduce the probability of occurrence of dangerous events.

Please refer to FIG. 1. FIG. 1 is a flowchart of a method for classifying a target image sequence shown in this application.

As shown in Figure 1, the above method may include:

S102: Perform object tracking on at least one object appearing in a target image sequence including multiple objects, and determine position change information of each object in the target image sequence.

S104: Perform graph convolution processing based on the position change information obtained in the target image sequence, and determine crowd behaviors corresponding to the plurality of objects in the target image sequence based on the extracted features obtained by the graph convolution.

The above classification method can be applied to electronic equipment. Wherein, the above-mentioned electronic device may execute the above-mentioned classification method by carrying a software system corresponding to the classification method. The types of the above electronic devices can be notebook computers, computers, servers, mobile phones, PAD terminals, etc. The present application does not specifically limit the specific types of the above electronic devices.

It can be understood that, the above classification method can be executed only by the terminal device or the server device alone, or can be executed by the terminal device and the server device in cooperation.

For example, the above classification method can be integrated in the client. After receiving the classification request, the terminal device equipped with the client can provide computing power through its own hardware environment to execute the above classification method.

For another example, the above classification method can be integrated into the system platform. After receiving the classification request, the server device equipped with the system platform can provide computing power through its own hardware environment to execute the above classification method.

For another example, the above classification method can be divided into two tasks: acquiring the target image sequence and classifying the target image sequence. Among them, the acquisition task can be integrated in the client and carried on the terminal device. The classification task can be integrated on the server and carried on the server device. The terminal device may initiate a classification request to the server device after acquiring the target image sequence. After receiving the above-mentioned classification request, the above-mentioned server device may execute the above-mentioned classification method on the above-mentioned target image sequence in response to the above-mentioned request. Hereinafter, the execution subject is an electronic device (hereinafter referred to as a device) as an example for description.

Please continue to refer to FIG. 2 , which is a schematic diagram of a flow of crowd behavior detection shown in the present application.

Before performing the process shown in FIG. 2 , the target image sequence may be acquired first. The above target image sequence refers to an image sequence containing multiple pedestrian objects and requiring crowd behavior detection. The target image sequence may include multiple frames of images.

In some examples, the target image sequence may include a video sequence or a multi-frame discrete image sequence. The above-mentioned video sequence includes N frames of consecutive images containing multiple objects; the above-mentioned N is a positive integer.

In some examples, when acquiring the target image sequence, the device may interact with the user to complete the input of the target image sequence. For example, the above-mentioned device may provide the user with a window for inputting the target image sequence to be processed through the interface carried by the device, so that the user can input the target image sequence. The user can complete the input of the target image sequence based on this window.

In some examples, the above-mentioned device may also be connected with an image acquisition device (eg, video surveillance) deployed on site, so as to acquire the target image sequence acquired by the above-mentioned image acquisition device from the above-mentioned image acquisition device.

After acquiring the target image sequence, S102 may be continued to perform object tracking on at least one object appearing in the target image sequence including multiple objects, to determine the position change information of each object in the target image sequence. The above-mentioned object tracking specifically refers to tracking the same object appearing in each frame of images. During object tracking, the same object appearing in each frame of images is determined to complete the object tracking. For example, in a security scenario, the above-mentioned object tracking is pedestrian tracking. Pedestrian tracking can be achieved by determining the same pedestrian appearing in each image during pedestrian tracking.

The above position change information may specifically indicate the movement track information of the object in the target image sequence. For example, in a special scene, pedestrian tracking can be performed on pedestrians, and the position information of the same pedestrian in each frame of image can be determined, thereby determining the movement trajectory of the pedestrian in the image sequence. It can be understood that the above position change information can represent the object position information and time domain information of the object in each image. The above-mentioned object position information may represent object coordinates. The above-mentioned time domain information can represent the time information corresponding to each position of the object.

Please continue to refer to FIG. 2 , in this application, the acquired target image sequence may be input into the object tracking unit to perform the above S102.

The above-mentioned object tracking unit may specifically execute S1022 through an instruction executable by the device, and use an object position prediction model to perform position prediction processing on each of the above-mentioned images, and determine the position information of each of the above-mentioned objects in each image. In this step, an object position prediction model may be used to perform position prediction processing on each of the above-mentioned images, so as to determine the position information of the above-mentioned object in each of the images. The above-mentioned object position prediction model includes a model trained based on several training samples marked with object position information.

It can be understood that the above-mentioned object position prediction model may be a neural network model constructed based on a deep convolutional network. Before using the model for position prediction, supervised training of the position prediction model can be performed using training samples marked with object position information until the model converges.

After the location information is determined, the object tracking unit may execute S1024 to perform object tracking on the object based on the location information, and determine the location change information of the object in the target image sequence.

The method of object tracking is not particularly limited in this application, and two object tracking methods are schematically given below.

Method 1: When performing S1024, a Kalman filter algorithm may be used to perform object tracking on each of the above objects, and to determine the position change information of each of the above objects.

In some examples, according to the acquisition sequence of the above-mentioned images, starting from the first frame image, successively determine two adjacent frames of images as the current two frames of images, and perform the following steps: use the Kalman filtering algorithm to determine the current two frames of images. The position information corresponding to each object contained; the position information corresponding to each object contained in the first image in the current two frames of images is respectively matched with the second image in the above-mentioned two current frames of images through the Hungarian matching algorithm (bipartite graph matching algorithm). The location information corresponding to each included object is matched.

Wherein, when the matching operation is performed, the distance between the position information corresponding to each object included in the first image and the position information corresponding to each object included in the second image may be calculated. If the calculated distance is less than the preset standard threshold, it can be determined that the two pieces of position information corresponding to the distance are the two pieces of position information in the matching process.

After the above matching operation is performed, the two objects corresponding to the two pieces of position information in the matching may be determined as the same object appearing in the above-mentioned two current frames of images, so as to implement object tracking on the above-mentioned objects.

After the above steps are performed for all adjacent images, the position change information of the object is determined based on the tracked position information of the same object in each image.

In the above method, the same object appearing in each of the above images can be determined, so that the same object can be tracked in each image. After the object tracking of the object is realized, the position change information of the object in the above target image sequence can be determined based on the position information of the object in each image.

Method 2: When performing S1024, the same object appearing in each of the above images may be determined based on the object detection model, so as to implement object tracking for each of the above objects.

The above-mentioned object detection models include models constructed based on deep learning networks. For example, the above-mentioned object detection model may specifically be a pre-trained semantic detection model (eg, models such as fast-rcnn, faster-rcnn, mask-rcnn, etc.). The object feature corresponding to the pedestrian object included in the image can be detected through the detection model. In some examples, the aforementioned object features may be human face features. After the object features included in each image are detected, similarity calculation may be performed on the object features included in different two frames of images, and the objects whose similarity reaches the second standard threshold are determined as the same object.

For example, in a security scene, the above-mentioned object target may be a pedestrian. At this time, the face contained in each image can be detected by the above-mentioned object detection model. After the faces included in each image are detected, similarity calculation may be performed on the face features included in different two frames of images, and the faces whose similarity reaches the second standard threshold are determined as the same face. After the same face is determined, it can be determined that the same pedestrian appears in the above two frames of images.

After the same object appearing in each frame of images is determined, the position change information of each object can be determined based on the tracked position information of the same object in each image.

In some examples, after the position change information corresponding to the objects is determined, the above position change information corresponding to each object may be stored in the form of a three-dimensional matrix (T*H*W). The number of channels of the three-dimensional matrix may be the number of image frames included in the target image sequence; the elements of the three-dimensional matrix may be the position coordinates of the object in the image corresponding to the channel serial number. It can be understood that, at this time, the above-mentioned three-dimensional matrix can be determined as the feature matrix corresponding to the above-mentioned target image sequence.

It can be understood that the above-mentioned position change information has time-domain characteristics, and can indicate the change of the position coordinates during the movement of the object within the time-domain range shown by the above-mentioned target image sequence. Based on the above position change information corresponding to each object appearing in the target image sequence, the motion characteristics of each object can be determined, that is, whether each object is gradually aggregated or gradually dispersed. Therefore, it is feasible to perform crowd behavior detection based on the location change information.

Please continue to refer to FIG. 2 , after determining the above-mentioned position change information, you can continue to perform S104, perform graph convolution processing based on the above-mentioned position change information obtained in the above-mentioned target image sequence, and determine the above-mentioned target based on the extracted features obtained by the above-mentioned graph convolution Crowd behavior corresponding to a plurality of the above objects in the image sequence.

Wherein, S1042 may be performed first, and graph convolution processing is performed based on the above-mentioned position change information obtained in the above-mentioned target image sequence, so as to obtain the extracted features corresponding to the above-mentioned target image sequence.

The above feature extraction specifically includes a feature matrix or feature vector determined by performing graph convolution processing (including spatial graph convolution and time domain graph convolution). It can be understood that the above-mentioned extracted features are determined based on the position change information of a plurality of pedestrian objects in the target image sequence, so the above-mentioned extracted features are beneficial for determining crowd behavior.

In some examples, before executing S1042, the connection relationship between objects in each image included in the target image sequence may be determined.

It can be understood that the connection relationships determined by using different connection relationship determination rules have different meanings. For example, the connection relationship determined by the similarity between the image features corresponding to the regions where the two objects are located in the image can represent the degree of association between the two objects from the perspective of similarity. For another example, the connection relationship determined by the distance between two objects can represent the degree of association between the two objects from the perspective of distance.

In some examples, the image features corresponding to the regions where the objects included in the above images are located in the images may be extracted. The above-mentioned image features represent the image information of the position of each object. If the image features of the two objects are relatively similar, it can be shown that the positions of the two objects are very similar, that is, the distances between the two objects are relatively close and have a connection relationship.

Afterwards, the similarity between any two objects in each object can be determined based on the image features corresponding to each object.

In some examples, two objects corresponding to a degree of similarity that reaches a first preset threshold may be determined as two objects having a connection relationship. Wherein, the above-mentioned first preset threshold includes a threshold set according to experience. The above-mentioned first preset threshold is not particularly limited in this application.

It should be noted that the present application also does not specifically limit the method for calculating the similarity. For example, the above-mentioned method for calculating the similarity may be methods such as Euclidean distance, cosine distance, Mahalanobis distance, and the like. In some examples, in order to improve the classification accuracy of the target image sequence, the connection relationship between the objects may be determined based on the distance between the objects.

Please refer to FIG. 3 , which is a schematic flowchart of a method for determining a connection relationship shown in the present application. As shown in FIG. 3 , S302 may be executed to perform image processing on each of the above-mentioned images, and to determine the position information of the above-mentioned object in each of the images. After that, S304 may be executed to determine the distance between any two objects in each object based on the position information corresponding to each object.

After the distance between any two objects is determined, the connection relationship between any two objects included in each image may be determined based on the above distance. In some examples, when the connection relationship between any two objects included in each image is determined based on the above distance, two objects corresponding to a distance that does not reach the second preset threshold may be determined as two objects with a connection relationship . Wherein, the above-mentioned second preset threshold includes a threshold set according to experience. The above-mentioned second preset threshold is not particularly limited in this application.

In some examples, if it is determined that there is a connection relationship between the two objects, the connection weight between the two objects is set to 1; otherwise, the connection weight between the two objects is set to 0.

Since the above connection relationship is determined by the distance between objects, the spatiotemporal map determined based on the connection relationship can indicate the distance relationship between objects, and the extracted features determined after the graph convolution operation on the spatiotemporal map can also include Distance information between objects. Therefore, when classifying crowd behaviors in the target image sequence based on the extracted features, the classification accuracy such as pedestrian gathering, pedestrian dispersion or pedestrian retention can be improved.

In some examples, in order to further improve the classification accuracy, the connection weight between two objects can be determined according to the true distance between the two objects.

Specifically, the determined distance between any two objects may be mapped into an interval formed by the third preset threshold and the fourth preset threshold. The third preset threshold and the fourth preset threshold are empirical thresholds. In some examples, the third preset threshold is 0, and the fourth preset threshold is 1.

After the above-mentioned mapping is completed, the distance between any two objects after the mapping can be determined as the connection weight between the above-mentioned arbitrary two objects, and the above-mentioned arbitrary two objects are indicated by the connection weight between the above-mentioned arbitrary two objects connection between.

Since the connection relationship between the two objects is determined by the real distance of the two objects in the above example, the above space-time map can indicate the distance information that is closer to the actual, thereby further improving the classification accuracy.

After the connection relationship between any two objects included in each image included in the target image sequence is determined, S104 may be continued.

Please continue to refer to Fig. 2, the above S1042 can be implemented by a graph convolution model.

The above graph convolution model may be a model constructed based on a spatiotemporal graph convolution processing network. The above-mentioned spatiotemporal graph convolution network at least includes a spatial graph convolution network (GCN) for performing spatial graph convolution processing on each frame of images, and a time domain for performing temporal convolution on the graph features corresponding to each frame image. Convolutional Networks (TCNs).

Please refer to FIG. 4 , which is a schematic diagram of a graph convolution processing flow shown in the present application.

As shown in FIG. 4 , in the above-mentioned S1042, the above-mentioned position change information may be input into the GCN included in the graph convolution model to execute S402, based on the object position information in each image included in the above-mentioned target image sequence represented by the above-mentioned position change information, and the connection relationship between the objects in each of the above images, perform spatial graph convolution processing on each of the above-mentioned images, respectively, to obtain the graph features corresponding to each of the images.

In this step, an adjacency matrix corresponding to each image may be determined based on the connection relationship between objects in each image. In some embodiments, a topology map corresponding to each image may be generated first. In practical applications, for each image, each object in each image can be used as the vertex V of the topology graph, and the edge E can be determined according to the connection relationship between the objects to obtain the topological graph corresponding to each image.

After the topological map corresponding to each image is generated, the adjacency matrix A corresponding to each of the above images can be determined based on the topological map corresponding to each of the above-mentioned images, and the feature matrix X ⁰ corresponding to each of the above-mentioned images can be determined based on the above-mentioned object position information. .

After the above-mentioned adjacency matrix and the above-mentioned feature matrix are determined, the above-mentioned spatial graph convolution processing may be completed based on the above-mentioned adjacency matrix and the above-mentioned characteristic matrix, so as to obtain the graph features corresponding to each of the above-mentioned images.

It should be noted that the above graph convolution formula is not particularly limited in this application. In some instances, it is possible to use

in

Increase the self-loop to maintain its own characteristics.

Yes

diagonal matrix. θ is the network parameter of the graph convolutional network (the specific training process is shown in the subsequent content of this application, and will not be described here). X ⁽¹⁾ is the input of the l+1st hidden layer in GCN, and X ^(l+1) is the output after the operation of the l+1st hidden layer.

After the graph features corresponding to the above images are obtained, the graph features can be input into the TCN included in the graph convolution model to execute S404, and time domain convolution processing is performed on the graph features corresponding to the above images to obtain the target image. The extracted features corresponding to the sequence.

In this step, the map features corresponding to each of the above images may be sorted according to the time domain information represented by the above position change information. Then, based on a preset one-dimensional convolution kernel, a one-dimensional convolution process is performed on the image features corresponding to the sorted images to obtain the extracted features corresponding to the above target image sequence.

Please continue to refer to FIG. 2 , after obtaining the extracted features corresponding to the above target image sequence, S1044 may be continued to determine crowd behaviors corresponding to the above-mentioned objects in the above-mentioned target image sequence based on the above-mentioned extracted features.

In this step, the above-mentioned extracted features may be input into a pre-trained multi-classifier for classification, so as to obtain the above-mentioned crowd behavior.

Please refer to FIG. 5 , which is a schematic diagram of a classification flow shown in this application.

As shown in Figure 5, the above-mentioned multi-classifier includes a downsampling unit and a fully connected layer. The above-mentioned down-sampling unit may be used to process the extracted features to obtain corresponding feature vectors. For example, the above-mentioned down-sampling unit may be an average pooling unit. The above-mentioned fully-connected layer is used to classify based on the above-mentioned feature vector, and obtain a confidence score corresponding to each preset classification type.

Please continue to refer to FIG. 5 , when S1044 is executed, the above-mentioned extracted features may be input into the down-sampling unit to execute S502, and the above-mentioned extracted features are averagely pooled to obtain corresponding feature vectors. After the above-mentioned feature vector is obtained, the feature vector can be input into the fully connected layer to execute S504, and the feature vector is fully connected to obtain the confidence score corresponding to each preset classification type.

After each confidence score is obtained, the crowd behavior type corresponding to the maximum confidence score can be determined as the crowd behavior corresponding to the plurality of objects in the target image sequence. Wherein, the above-mentioned crowd behavior at least includes at least one of the following: pedestrians gather; pedestrians are scattered; pedestrians stay; pedestrians reverse flow.

In the above method, the position change information of the above-mentioned object in the above-mentioned target image sequence is determined by performing object tracking on the object appearing in the above-mentioned target image sequence. Then, graph convolution processing is performed based on the above position change information to obtain extracted features corresponding to the above target image sequence, and based on the above extracted features, crowd behaviors corresponding to the plurality of above objects in the above target image sequence are determined. In this way, the principle of graph convolution is used to determine the extraction features that are beneficial to the detection of crowd behavior from the target image sequence, so as to realize the accurate detection of crowd behavior represented by the target image sequence.

Embodiments are described below in combination with security scenarios. The above security scenarios usually set up monitoring equipment. The surveillance equipment typically captures video sequences. It can be understood that in the security scenario, the video sequences collected by the health equipment are actually classified. Please refer to FIG. 6 , which is a schematic diagram of a video sequence classification flow diagram shown in the present application.

After the target video sequence is acquired, S602 may be performed based on the coordinate determination unit to perform image processing on each image included in the target video sequence, to determine the position information of pedestrians appearing in the video in each image.

After the location information is determined, S604 may be executed based on the pedestrian tracking unit, and based on the location information, object tracking is performed on the pedestrian to determine the location change information of the pedestrian in the target image sequence.

After the position change information is determined, S606 may be performed based on the image convolution model included in the graph convolution classification model, and graph convolution processing is performed based on the position change information to obtain the extracted features corresponding to the target image sequence.

The above graph convolution classification model may specifically be a classification model constructed based on a graph convolution model and a multi-classification model. Through the graph convolution classification model, on the one hand, a graph convolution operation can be performed on the spatiotemporal graph to determine the extraction feature corresponding to the above-mentioned spatiotemporal graph; The classification type of the sequence.

After the extraction features are determined, S608 may be performed based on the multi-classification model included in the graph convolution classification model to determine crowd behaviors corresponding to the objects in the target image sequence based on the extraction features.

In the above scheme, the principle of graph convolution is used first, and extraction features that can reflect the distance change information of each pedestrian in the video sequence are determined based on the position change information of the pedestrians in the video sequence. Then, the classification type of the above-mentioned video sequence is determined based on the above-mentioned extracted features, so as to determine the pedestrian behavior occurring in the video sequence, and make corresponding arrangements according to the determined pedestrian behavior to reduce the probability of occurrence of dangerous events.

The above is an introduction to the image sequence classification scheme shown in this application, and the following describes the training method of the used graph convolution classification model.

The above graph convolution classification model can be used to implement the above graph convolution processing.

In some examples, the graph convolutional classification models described above may include graph convolutional models as well as multi-classification models. The above-mentioned graph convolution model may use the position change information of each object in the target image sequence as input to perform graph convolution processing, and obtain the extracted features corresponding to the above-mentioned target image sequence. The above-mentioned multi-classification model may take the above-mentioned extracted features as input, and perform classification processing on the above-mentioned extracted features, so as to obtain the crowd behavior represented by the above-mentioned target image sequence.

It can be understood that the training of the graph convolution classification model is actually a process of determining the model parameters included in the above graph convolution model and the above multi-classification model.

A model training method is proposed in this application. The method trains the graph convolution classification model by constructing virtual training samples, so that model training can also be achieved in the absence of real samples.

Please refer to FIG. 7 , which is a method flowchart of a model training method shown in this application. As shown in FIG. 7 , the above training method includes: S702 , generating a training sample, wherein the training sample has position change information including multiple objects, and has annotation information of crowd behavior based on the position change information of the multiple objects.

In this step, S7022 may be executed first, and based on the motion simulation platform, the motion mode corresponding to the object appearing in the video is set.

The above-mentioned motion simulation platform is specifically any platform that can perform motion simulation. In some examples, the motion simulation platform described above may be a game development platform.

The above-mentioned movement mode may include speed and movement direction. Through the above motion mode, on the one hand, the coordinates of the objects in each frame of images included in the above video can be determined, so as to determine the position change information of each object in the above video. On the other hand, the crowd behavior represented by the above video can be obtained. For example, in a security scene, when the movement patterns of pedestrians are in the same direction, the crowd behavior represented by the video can be determined to be pedestrian gathering; otherwise, the crowd behavior represented by the video can be determined to be pedestrian dispersion.

After determining the motion mode of each object, S7024 may be executed to determine the position change information corresponding to each object based on the above motion mode, and determine the crowd behavior represented by the position change information corresponding to each object. The above-mentioned crowd behaviors may include pedestrian gathering, pedestrian dispersion, and pedestrian retention.

After the location change information and the crowd behavior represented by the video are determined, S7026 may be executed to generate the training sample based on the location change information and the crowd behavior represented by the location change information.

In this step, the position change information and the above classification types may be encoded by means of one-hot encoding, so as to obtain several training samples. The present application does not limit the specific manner of the above encoding.

After the above-mentioned training samples are obtained, S704 may be continued, and the above-mentioned graph convolution classification model is trained based on the preset loss information and the above-mentioned training samples, until the model converges.

The above-mentioned preset loss information may be loss information set according to experience.

When training a model, you can first specify hyperparameters such as learning rate, number of training loops, etc. After the above hyperparameters are determined, the above graph convolution classification model (hereinafter referred to as the model) may be supervised training based on the above training samples.

During a supervised training process, forward propagation can be performed to obtain the computational results output by the model. After the calculation result output by the model is obtained, the error between the real classification type and the above calculation result can be evaluated based on the above preset loss information. After obtaining the above error, the stochastic gradient descent method can be used to determine the descending gradient. After the descending gradient is determined, the model parameters corresponding to the model can be updated based on backpropagation.

The above process can then be repeated until the model converges. It should be noted that the condition for the convergence of the above model may be, for example, that the preset number of training times is reached, or the variation of the error obtained after M consecutive forward propagations is less than a certain threshold. The present application does not specifically limit the conditions for model convergence.

In the above training method, since the training samples are used to train the graph convolution classification model, the training process does not need to rely on real training samples.

In some examples, an object location prediction model for determining object location, an object tracking model for object tracking, and a graph convolution classification model for graph convolution processing and classification may also be jointly trained.

In some examples, videos representing pedestrian gathering and pedestrian dispersion can be constructed through a motion simulation platform, and crowd behavior annotations can be performed on the constructed videos to obtain training samples.

After the training samples are obtained, the training samples can be input into the above-mentioned object position prediction model to obtain the first calculation result. Then, the above-mentioned first calculation result is input into the above-mentioned object tracking model to obtain the second calculation result. Then, the above-mentioned second calculation result is input into the above-mentioned graph convolution classification model to obtain the crowd behavior detection result for the video representation.

After the detection result is obtained, the parameter update of each model can be completed by using the back-propagation method according to the label information corresponding to the above-mentioned virtual identification.

In the above example, joint training of each model can be realized to improve training efficiency.

Corresponding to any of the above embodiments, the present application further provides a crowd behavior detection device.

Please refer to FIG. 8 , which is a schematic structural diagram of a crowd behavior detection apparatus shown in the present application.

As shown in FIG. 8 , the above-mentioned apparatus 80 includes: a position change information determination module 81 , based on the object tracking result of at least one object appearing in the target image sequence including a plurality of objects, to determine the position of each object in the above-mentioned target image sequence. position change information; the crowd behavior detection module 82 is configured to perform graph convolution processing based on the above position change information obtained in the above-mentioned target image sequence, and determine a plurality of above-mentioned target image sequences based on the extracted features obtained by the above-mentioned graph convolution Crowd behavior corresponding to the object.

In some of the illustrated embodiments, the above-mentioned position change information determination module 81 is specifically configured to: perform image processing on each image included in the above-mentioned target image sequence, respectively, to determine the position information of each of the above-mentioned objects in each image; Object tracking is performed for each of the above-mentioned objects, so as to determine the position change information of each of the above-mentioned objects in the above-mentioned target image sequence based on the tracking result and the above-mentioned position information.

In some of the illustrated embodiments, the above-mentioned position change information determination module 81 is specifically configured to: use a Kalman filter algorithm or an object detection model to perform object tracking on each of the above-mentioned objects; Position information, to determine the position change information of each of the above objects.

In some of the illustrated embodiments, the above-mentioned crowd behavior detection module 82 includes: a spatial graph convolution module, which is used for the object position information in each image included in the above-mentioned target image sequence represented by the above-mentioned position change information and the object position information in each of the above-mentioned images. The connection relationship between objects, perform spatial graph convolution processing on each of the above images, respectively, to obtain the corresponding graph features of each image; the crowd behavior determination module is used to perform temporal convolution processing on the graph features corresponding to each of the above images. , and determine crowd behaviors corresponding to multiple objects in the target image sequence based on the extracted features obtained by the temporal convolution processing; wherein, the crowd behaviors include at least one of the following: pedestrians gather; pedestrians are scattered; pedestrians stay ; Pedestrian countercurrent.

In some of the illustrated embodiments, the above-mentioned spatial graph convolution processing module is specifically configured to: determine the adjacency matrix corresponding to each of the above-mentioned images based on the connection relationship between the objects in each of the above-mentioned images; The feature matrix corresponding to each image; the spatial graph convolution process is completed based on the adjacency matrix and the feature matrix, and the graph feature corresponding to each image is obtained.

In some of the illustrated embodiments, the above-mentioned apparatus 80 further includes: a connection relationship determination module, configured to determine the connection relationship between any two objects included in each image included in the above-mentioned target image sequence.

In some of the illustrated embodiments, the above-mentioned connection relationship determination module is specifically configured to: extract image features corresponding to the regions where the objects included in the images are located in the image; the image features represent the images where the objects are located information; based on the image features corresponding to each object, determine the similarity between any two objects in each object; and determine the two objects corresponding to the similarity not reaching the first preset threshold as two objects with a connection relationship.

In some of the illustrated embodiments, the connection relationship determining module is specifically configured to: perform image processing on each of the above-mentioned images, respectively, to determine the position information of the above-mentioned object in each image; The distance between any two objects; the connection relationship between any two objects included in each image is determined based on the above distance.

In some of the illustrated embodiments, the above-mentioned connection relationship determination module is specifically configured to: map the determined distance between any two objects to the interval formed by the third preset threshold and the fourth preset threshold; The distance between any two following objects is determined as the connection weight between the above-mentioned arbitrary two objects; the connection relationship between the above-mentioned arbitrary two objects is indicated by the connection weight between the above-mentioned arbitrary two objects.

In some of the illustrated embodiments, the graph convolution processing is implemented by a graph convolution classification model; wherein, the training device for the graph convolution classification model includes: a generating module for generating training samples, wherein the training samples have Contains position change information of a plurality of objects, and annotation information of crowd behaviors based on the position change information of the above-mentioned multiple objects; a training module is used for the preset map volume based on the position change information and the annotation information of the crowd behaviors. The product model is trained to obtain the above graph convolution classification model.

In some of the illustrated embodiments, the above-mentioned generating module is specifically used for: setting motion patterns corresponding to multiple objects based on the motion simulation platform; determining the position change information corresponding to each object based on the above-mentioned motion pattern; The crowd behavior represented by the position change information; the above training samples are generated based on the above position change information and the crowd behavior represented by the above position change information.

The embodiments of the crowd behavior detection apparatus shown in this application can be applied to electronic devices. Accordingly, the present application discloses an electronic device, which may include: a processor and a memory for storing instructions executable by the processor. Wherein, the above-mentioned processor is configured to invoke the executable instructions stored in the above-mentioned memory to implement the crowd behavior detection method shown in any of the above-mentioned embodiments.

Please refer to FIG. 9 , which is a schematic diagram of a hardware structure of an electronic device shown in this application.

As shown in FIG. 9 , the electronic device may include a processor for executing instructions, a network interface for network connection, a memory for storing operating data for the processor, and a memory for storing instructions corresponding to the crowd behavior detection apparatus. non-volatile memory.

The embodiments of the foregoing apparatus may be implemented by software, or may be implemented by hardware or a combination of software and hardware. Taking software implementation as an example, a device in a logical sense is formed by reading the corresponding computer program instructions in the non-volatile memory into the memory for operation by the processor of the electronic device where the device is located. From a hardware perspective, in addition to the processor, memory, network interface, and non-volatile memory shown in FIG. 9 , the electronic device where the apparatus is located in the embodiment may also include other electronic devices based on the actual functions of the electronic device. Hardware, no further details on this. It can be understood that, in order to improve the processing speed, the corresponding instructions of the crowd behavior detection apparatus may also be directly stored in the memory, which is not limited herein.

This application proposes a computer-readable storage medium, which may be a volatile storage medium or a non-volatile storage medium. The storage medium stores a computer program, and the computer program is used to execute the crowd behavior shown in any of the foregoing embodiments. Detection method.

As will be appreciated by those skilled in the art, one or more embodiments of the present application may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may employ a computer implemented above in one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein The form of the program product.

In this application, "and/or" means at least one of the two, for example, "A and/or B" may include three schemes: A, B, and "A and B".

Each embodiment in this application is described in a progressive manner, and the same and similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the data processing device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment.

The foregoing describes specific embodiments of the present application. Other embodiments are within the scope of the appended claims. In some cases, the acts or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. Additionally, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this application can be implemented in digital electronic circuits, in tangible embodiment of computer software or firmware, in computer hardware which can include the structures disclosed in this application and their structural equivalents, or in A combination of one or more of. Embodiments of the subject matter described in this application may be implemented as one or more computer programs, ie, one or more of the computer program instructions encoded on a tangible non-transitory program carrier as described above for execution by or to control the operation of a data processing apparatus or multiple modules. Alternatively or additionally, the program instructions may be encoded in an artificially generated propagated signal as described above, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for use by the data. The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows described above can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).

A computer suitable for the execution of a computer program may include, for example, a general and/or special purpose microprocessor, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read only memory and/or random access memory. The basic components of a computer may include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operably coupled to, such mass storage devices to receive data therefrom or to include one or more mass storage devices, such as magnetic disks, magneto-optical disks, or optical disks, etc., for storing data. Send data to it, or both. However, the computer does not have to have such a device. Additionally, the computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB) ) flash drives for portable storage devices, to name a few.

Computer readable media suitable for storage of computer program instructions and data may include all forms of non-volatile memory, media, and memory devices, and may include, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disks) or removable discs), magneto-optical discs, and CD-ROM and DVD-ROM discs. The processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

While this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or what may be claimed, but rather are used primarily to describe features of particular disclosed specific embodiments. Certain features that are described herein in the context of multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may function as above in certain combinations and even be claimed as such initially, one or more features from a claimed combination may in some instances be removed from the combination and claimed A combination of can point to a subcombination or a variation of a subcombination.

Similarly, although operations are depicted in the figures in a particular order, this should not be construed as requiring the operations to be performed in the particular order shown or sequentially, or that all illustrated operations be performed, to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the dispersal of various system modules and components in the above-described embodiments should not be construed as requiring such dispersal in all embodiments, and it should be understood that the described program components and systems may generally be integrated together in a single software product , or packaged into multiple software products.

Thus, specific embodiments of the subject matter have been described. Other embodiments are within the scope of the appended claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above are only preferred embodiments of one or more embodiments of the present application, and are not intended to limit one or more embodiments of the present application. All within the spirit and principles of one or more embodiments of the present application, all Any modification, equivalent replacement, improvement, etc. made should be included within the protection scope of one or more embodiments of the present application.

Claims

A method for detecting crowd behavior, characterized in that the method comprises:

Perform object tracking on at least one object appearing in a target image sequence including multiple objects, and determine the position change information of each object in the target image sequence;

Graph convolution processing is performed based on the position change information obtained in the target image sequence, and crowd behaviors corresponding to a plurality of the objects in the target image sequence are determined based on the extracted features obtained by the graph convolution.
The method according to claim 1, wherein the object tracking is performed on at least one object appearing in a target image sequence including multiple objects, and the position change information of each object in the target image sequence is determined, include:

Perform image processing on each image included in the target image sequence to determine the position information of each object in the corresponding image;

Object tracking is performed on each object to determine position change information of each object in the target image sequence based on the tracking result and the position information.
3. The method of claim 2, wherein the object tracking is performed on the each object to determine the position of the each object in the target image sequence based on the tracking result and the position information Change information, including:

Using a Kalman filter algorithm or an object detection model to perform object tracking on each of the objects;

Based on the tracked position information of the same object in the corresponding image, the position change information of each object is determined.
The method according to any one of claims 1-3, wherein the graph convolution process is performed based on the position change information obtained in the target image sequence to obtain a plurality of all the target image sequences. The behavior of the crowd corresponding to the object described, including:

Based on the object position information in at least one image included in the target image sequence represented by the position change information, and the connection relationship between objects in the at least one image, spatial graph convolution is performed on the at least one image respectively. processing to obtain the map features corresponding to at least one image respectively;

Perform time-domain convolution processing on the map features corresponding to the at least one image respectively, and determine crowd behaviors corresponding to a plurality of the objects in the target image sequence based on the extracted features obtained by the time-domain convolution processing; wherein , the crowd behavior at least includes at least one of the following: pedestrians gather; pedestrians disperse; pedestrians stay; pedestrians reverse flow.
The method according to claim 4, wherein the target image sequence represented based on the position change information includes the object position information in at least one image, and the position information between the objects in the at least one image. The connection relationship is to perform spatial graph convolution processing on the at least one image respectively to obtain the graph features corresponding to the at least one image respectively, including:

based on the connection relationship between objects in the at least one image, determining an adjacency matrix corresponding to the at least one image respectively;

determining, based on the object position information, feature matrices corresponding to the at least one image respectively;

The spatial graph convolution process is completed based on the adjacency matrix and the feature matrix, and the graph features corresponding to each image are obtained.
The method according to any one of claims 1-5, wherein the graph convolution processing is performed based on the position change information obtained in the target image sequence to obtain the extracted features corresponding to the target image sequence Before the steps, also include:

A connection relationship between any two objects included in at least one image included in the target image sequence is determined.
The method according to claim 6, wherein the determining the connection relationship between any two objects included in at least one image included in the target image sequence comprises:

extracting image features corresponding to the region where the at least one object contained in the at least one image is located in the image; the image features represent image information of the location where the at least one object is located; and determining the at least one object based on the image features corresponding to the at least one object similarity between any two objects in

Two objects corresponding to a degree of similarity that does not reach the first preset threshold are determined as two objects having a connection relationship.
The method according to claim 6, wherein the determining the connection relationship between any two objects included in at least one image included in the target image sequence comprises:

Perform image processing on the at least one image respectively to determine the position information of the object in the at least one image;

determining the distance between any two objects in the at least one object based on the position information corresponding to the at least one object;

A connection relationship between any two objects included in the at least one image is determined based on the distance.
The method according to claim 8, wherein the determining a connection relationship between any two objects included in the at least one image based on the distance comprises:

mapping the determined distance between any two objects to the interval formed by the third preset threshold and the fourth preset threshold;

determining the distance between any two objects after mapping as the connection weight between the any two objects;

The connection relationship between any two objects is indicated by the connection weight between the any two objects.
The method according to any one of claims 1-9, wherein,

The graph convolution processing is realized by a graph convolution classification model;

Wherein, the training method of the graph convolution classification model includes:

generating a training sample, wherein the training sample has position change information including a plurality of objects, and has annotation information of crowd behavior based on the position change information of the plurality of objects;

The preset graph convolution model is trained based on the position change information and the annotation information of the crowd behavior to obtain the graph convolution classification model.
The method according to claim 10, wherein the generating training samples comprises:

Based on the motion simulation platform, set the motion modes corresponding to multiple virtual objects;

Based on the motion pattern, determine position change information corresponding to at least one virtual object;

determining the crowd behavior represented by the position change information corresponding to the at least one virtual object;

The training samples are generated based on the location change information and the crowd behavior represented by the location change information.
A crowd behavior detection device, characterized in that the device comprises:

a position change information determination module, for determining the position change information of each object in the target image sequence based on the object tracking result of at least one object appearing in the target image sequence including a plurality of objects;

A crowd behavior detection module, configured to perform graph convolution processing based on the position change information obtained in the target image sequence, and determine a plurality of said target image sequences based on the extracted features obtained by the graph convolution Crowd behavior corresponding to the object.
An electronic device, characterized in that the device comprises:

processor;

a memory for storing the processor-executable instructions;

Wherein, the processor is configured to call executable instructions stored in the memory to implement the crowd behavior detection method according to any one of claims 1-11.
A computer-readable storage medium, characterized in that the storage medium stores a computer program, and the computer program is used to execute the crowd behavior detection method according to any one of claims 1-11.
A computer program product, characterized in that, when the computer program product runs on a computer, the computer is made to execute the crowd behavior detection method according to any one of claims 1-11.