CN120088870B - Three-dimensional human behavior recognition method, device, terminal and medium - Google Patents
Three-dimensional human behavior recognition method, device, terminal and medium Download PDFInfo
- Publication number
- CN120088870B CN120088870B CN202510588159.8A CN202510588159A CN120088870B CN 120088870 B CN120088870 B CN 120088870B CN 202510588159 A CN202510588159 A CN 202510588159A CN 120088870 B CN120088870 B CN 120088870B
- Authority
- CN
- China
- Prior art keywords
- space
- time
- subsequence
- dimensional
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The application relates to the technical field of computer vision, in particular to a three-dimensional human behavior recognition method, a device, a terminal and a medium, wherein the method comprises the steps of inputting four-dimensional point cloud data into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data; sequencing three-dimensional point clouds of each frame in each subsequence to obtain an ordered space sequence, splicing the ordered space sequences according to time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence, determining central point characteristics corresponding to each subsequence according to the spliced ordered space-time sequence, obtaining low-order space-time characteristics, and obtaining an action recognition result based on the low-order space-time characteristics and the central point characteristics of each subsequence. According to the application, by taking the space and time information into consideration, complex space-time dependency can be captured, and the calculation complexity is reduced through the trained state space model, so that the accuracy and efficiency of the behavior recognition result are improved.
Description
Technical Field
The present invention relates to the field of computer vision, and in particular, to a three-dimensional human behavior recognition method, apparatus, terminal, and medium.
Background
The four-dimensional point cloud video can simultaneously capture dynamic geometric information of a three-dimensional space and motion characteristics changing along with time, so that the four-dimensional point cloud video can be applied to tasks such as motion recognition, human body posture estimation, environment modeling, intelligent interaction and the like. Compared with the traditional RGB video and depth image, the four-dimensional point cloud video has higher robustness under the condition of low illumination or visual angle change, and is particularly suitable for human behavior analysis in complex and dynamic environments.
However, in the prior art, the architecture with the secondary complexity is difficult to capture the space-time dependence of the four-dimensional point cloud efficiently, and the selective state space model with the linear complexity is of a unidirectional recursion structure, so that the application effect in the space-time disordered four-dimensional point cloud is limited, and the accuracy and the efficiency of the behavior recognition result are lower.
Accordingly, the prior art has drawbacks and needs to be improved and developed.
Disclosure of Invention
The application provides a three-dimensional human behavior recognition method, a device, a terminal and a medium, which are used for solving the technical problems of lower accuracy and efficiency of behavior recognition results in the related technology.
In order to achieve the above purpose, the present application adopts the following technical scheme:
a method of three-dimensional human behavior recognition, wherein the method comprises:
inputting four-dimensional point cloud data to be analyzed into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data;
sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud;
splicing the ordered space sequences corresponding to the three-dimensional point clouds of all frames in each subsequence according to time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence;
determining the central point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence;
And acquiring low-order space-time characteristics extracted from four-dimensional point cloud data to be analyzed, and acquiring an action recognition result based on the low-order space-time characteristics and the central point characteristics of each subsequence.
In one embodiment of the present application, determining the center point feature corresponding to each sub-sequence according to the spliced ordered spatio-temporal sequence corresponding to each sub-sequence includes:
constructing a space-time neighborhood graph of each central point on each spliced ordered space-time domain;
and carrying out normalization processing and feature fusion on the point features in each space-time neighborhood graph to obtain the center point feature corresponding to each subsequence.
In one embodiment of the present application, constructing a spatiotemporal neighborhood graph for each center point on each of the stitched ordered spatiotemporal systems includes:
And constructing a space-time neighborhood graph of each central point on each spliced ordered space-time by using a K nearest neighbor method and a space-time embedding method.
In one embodiment of the present application, obtaining the motion recognition result based on the low-order spatiotemporal features and the center point features of each of the sub-sequences includes:
Inputting the low-order space-time features into a gesture encoder to obtain predicted skeleton key points;
inputting the skeleton key points into a gesture decoder to obtain high-dimensional geometric characteristics;
and fusing the high-dimensional geometric features with the central point features of the subsequences to obtain an action recognition result.
In one embodiment of the application, the training step of the state space model comprises:
acquiring a training data set, wherein the training data set comprises four-dimensional point cloud training data and corresponding action labels;
Inputting four-dimensional point cloud training data into an initial state space model, and extracting training subsequences of different time scales in the four-dimensional point cloud training data;
sequencing each frame of three-dimensional point cloud in each training subsequence to obtain ordered space sequence training data corresponding to each frame of three-dimensional point cloud;
Splicing the ordered space sequence training data corresponding to the three-dimensional point clouds of all frames in each training subsequence according to time sequence to obtain spliced ordered space-time sequence training data corresponding to each training subsequence;
determining central point characteristic training data corresponding to each training subsequence according to the spliced ordered space-time sequence training data corresponding to each training subsequence;
And acquiring low-order space-time feature training data extracted from the four-dimensional point cloud training data, and training based on the low-order space-time feature training data, the central point feature training data of each training subsequence and the action labels to obtain a trained state space model.
In one embodiment of the present application, determining center point feature training data corresponding to each training sub-sequence according to the concatenated ordered spatio-temporal sequence training data corresponding to each training sub-sequence includes:
constructing a space-time neighborhood training diagram of each central point on each spliced ordered space-time sequence training data;
And carrying out normalization processing and feature fusion on the point features in each time-space neighborhood training diagram to obtain the central point feature training data corresponding to each training subsequence.
In one embodiment of the application, the training data set further comprises real skeleton key points corresponding to four-dimensional point cloud training data;
Training based on the low-order space-time feature training data, the central point feature training data of each training sub-sequence and the action labels to obtain a trained state space model, wherein the training comprises the following steps:
inputting the low-order space-time feature training data into a gesture encoder to obtain predicted skeleton key points;
Calculating the mean square error loss between the predicted skeleton key points and the real skeleton key points corresponding to the four-dimensional point cloud training data so as to train the gesture encoder;
inputting the predicted skeleton key points into a gesture decoder to obtain high-dimensional geometric feature training data;
And fusing the high-dimensional geometric feature training data with the central point feature training data of each training subsequence, and training by taking the action label as a true value to obtain a trained state space model.
The application also provides a three-dimensional human behavior recognition device, wherein the device comprises:
the extraction module is used for inputting four-dimensional point cloud data to be analyzed into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data;
the sequencing module is used for sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud;
The splicing module is used for splicing the ordered space sequences corresponding to the three-dimensional point clouds of all the frames in each subsequence according to the time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence;
The determining module is used for determining the center point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence;
the recognition module is used for acquiring low-order space-time characteristics extracted from four-dimensional point cloud data to be analyzed, and obtaining action recognition results based on the low-order space-time characteristics and central point characteristics of each subsequence.
The application also provides a terminal which comprises a memory, a processor and a three-dimensional human body behavior recognition program stored on the memory and capable of running on the processor, wherein the three-dimensional human body behavior recognition program realizes the steps of the three-dimensional human body behavior recognition method when being executed by the processor.
The present application also provides a computer readable storage medium storing a computer program executable for implementing the steps of the three-dimensional human behavior recognition method as described above.
The method has the advantages that the four-dimensional point cloud data to be analyzed are input into a trained state space model, subsequences with different time scales in the four-dimensional point cloud data are extracted, each frame of three-dimensional point cloud in each subsequence is ordered to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud, the ordered space sequences corresponding to the three-dimensional point clouds in each subsequence are spliced according to time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence, center point characteristics corresponding to each subsequence are determined according to the spliced ordered space-time sequence corresponding to each subsequence, low-order space-time characteristics extracted in the four-dimensional point cloud data to be analyzed are obtained, action recognition results are obtained based on the low-order space-time characteristics and the center point characteristics of each subsequence, space and time information are considered, complex space-time dependency relations can be captured, calculation complexity is reduced through the trained state space model, and further accuracy and efficiency of the action recognition results are improved.
Drawings
FIG. 1 is a flow chart of a three-dimensional human behavior recognition method according to a preferred embodiment of the present invention.
Fig. 2 is a schematic block diagram of the present invention from four-dimensional point cloud data input to motion recognition result output.
FIG. 3 is a test result of the state space model of the present invention on a test set.
Fig. 4 is a schematic diagram of the visual attention of three-dimensional human behavior recognition in the present invention.
Fig. 5 is a functional block diagram of a preferred embodiment of the three-dimensional human behavior recognition device of the present invention.
Fig. 6 is a functional block diagram of a preferred embodiment of the terminal of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The prior art has the following key drawbacks and limitations in four-dimensional point cloud video analysis:
First, space-time dependencies cannot be captured efficiently. The existing Convolutional Neural Network (CNN) based method cannot effectively process long-time space-time dependency relationships in four-dimensional point clouds. Although convolutional neural networks can extract local geometric features, they are weak in timing modeling, especially when processing long time sequences, which is prone to loss of timing information.
Second, the computing resource consumption is large. Although the model based on the transducer can capture long-time dependence, the computational complexity and the memory consumption of the model are increased sharply along with the increase of the length of an input sequence, and the application of the model in high-dimensional and long-sequence point cloud data is limited. This results in higher hardware requirements and computational resources, which limit efficiency in practical applications.
Third, the problem of spatiotemporal disorder. Although the existing State Space Model (SSM) can effectively process space data, due to the lack of effective time sequence modeling, space-time correlation is not fully utilized, so that the performance in space-time joint modeling task of four-dimensional point cloud video is poor.
Fourth, lack of robustness against data imperfections and noise. Existing four-dimensional point cloud analysis methods generally assume that the input data is complete and noiseless. However, in practical applications, the point cloud data tends to be incomplete or contain noise due to hardware limitations. The prior art has a weak processing power for these incomplete and noisy data, resulting in a poor robustness of the model in the face of sparse and noisy data sets.
Aiming at the defects of the prior art, the method solves the problems that the prior method cannot effectively capture space-time dependence, has high computational complexity, is difficult to model in space-time disorder, lacks robustness and the like. Specifically, the space-time dependency in the four-dimensional point cloud video can be effectively processed through the combined space-time serialization and structural modeling, the calculation resource consumption is greatly reduced through the state space model, and the efficiency and accuracy of the model in the long-sequence modeling are improved.
The three-dimensional human behavior recognition method, device, terminal and medium of the embodiment of the application are described below with reference to the accompanying drawings. Aiming at the problem that the accuracy of behavior recognition results is low because the space point cloud data and the time information of the four-dimensional point cloud data in the related technology are often unordered, the application provides a three-dimensional human behavior recognition method, in the method, the four-dimensional point cloud data to be analyzed are input into a trained state space model, and subsequences of different time scales in the four-dimensional point cloud data are extracted; sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud, splicing the ordered space sequences corresponding to the three-dimensional point clouds in each subsequence according to time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence, determining a central point feature corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence, acquiring low-order space-time features extracted from four-dimensional point cloud data to be analyzed, and obtaining an action recognition result based on the low-order space-time features and the central point features of each subsequence. According to the application, the three-dimensional point clouds of each frame are sequenced to obtain an ordered space sequence, so that space and time information is considered, complex space-time dependency relationships can be captured, the calculation complexity is reduced through a trained state space model, and the accuracy and efficiency of a behavior recognition result are improved.
Referring to fig. 1, the three-dimensional human behavior recognition method according to the embodiment of the invention includes the following steps:
And step S100, inputting four-dimensional point cloud data to be analyzed into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data.
The four-dimensional point cloud data can be a four-dimensional point cloud video, and the trained state space model provided by the application is used for high-efficiency analysis and human action recognition of the four-dimensional point cloud data. In one embodiment, the state space model comprises a hierarchical ordered sequencer, a cross-timing serialization module, a spatio-temporal structure aggregation layer, and a gesture-aware feature optimization module. According to the embodiment of the application, the unordered four-dimensional point cloud is converted into the ordered sequence through cross-time sequence serialization, and then the space-time dependency relationship is efficiently captured by using a state space model. Simultaneously, a space-time structure aggregation layer and a layered ordered sequencer are introduced, so that feature extraction and multi-scale space-time modeling are further optimized. In addition, the gesture perception feature optimization module enhances the robustness of the model in processing sparse, incomplete and high noise data sets by introducing gesture estimation branches. Therefore, the method can efficiently and accurately capture the complex space-time dependency relationship in the four-dimensional point cloud data while maintaining the linear computation complexity, not only improves the accuracy and efficiency of motion recognition, but also is superior to the traditional convolutional neural network and the traditional transducer architecture in terms of operation time and memory use.
As shown in fig. 2, the four-dimensional point cloud data input in fig. 2 includesFour-dimensional point clouds corresponding to time. The four-dimensional point cloud data is processed by utilizing point 4D convolution, so that the hierarchical ordered sequencer and the gesture perception feature optimization module can process the four-dimensional point cloud data. Wherein, the Representing subsequences of different time scales.
Specifically, the layered ordered sequencer can balance the time sequence changes of high frequency and low frequency in the four-dimensional point cloud, and the receptive field of the model is expanded through multi-scale time downsampling.
Specifically, the four-dimensional point cloud data is represented as: Wherein, the method comprises the steps of, Representing four-dimensional point cloud data, R represents a real set (Real numbers), and T x N x 3 represents a feature having T time steps, N points, and 3 channels per point. The low-order spatiotemporal features of the four-dimensional point cloud data are expressed as: . Wherein, the Representing low-order spatiotemporal features, T x N x C represents a feature having T time steps, N points, and C channels per point.
The hierarchical ordered sequencer adopts a downsampling strategy of exponential step length to extract subsequences of different time scales, and the formula is as follows:
;
;
Wherein, the Representing a slice vector for X,Representation ofIs a subscript of the traversal value, e.g. the total frame number has 24 frames, S is the sampling level, if s=2, then t is from 1 to 12, so tS is 2,4,6,8,..24. N is a traversal from 1 to N, C is a traversal from 1 to a coordinate dimension number 3, and f is a traversal from 1 to a feature dimension number C. Step length isRepresenting different time scales.
According to the embodiment of the application, the multi-scale downsampling is performed by using the layered ordered sequencer, so that the capturing of high-frequency details and low-frequency structures is balanced, the receptive field of the model is expanded, and the model performs better when long time sequence actions are processed.
As shown in fig. 1, the three-dimensional human behavior recognition method further includes the following steps:
And step 200, sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud.
And step S300, splicing the ordered space sequences corresponding to the three-dimensional point clouds of all frames in each subsequence according to the time sequence, so as to obtain a spliced ordered space-time sequence corresponding to each subsequence.
The embodiment of the application also provides a cross-time sequence serialization module which converts unordered four-dimensional point cloud data into an ordered sequence so as to adapt to the unidirectional modeling requirement of a State Space Model (SSM). And ordering three-dimensional point clouds of each frame in the subsequence by using a Hilbert curve (Hilbert curve), maintaining local continuity in space, and reducing distance difference of adjacent points in the sequence. And, the point cloud of each frame is serialized in time sequence, so that the consistency of time sequence information is ensured.
As shown in fig. 1, the three-dimensional human behavior recognition method further includes the following steps:
Step S400, determining the central point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence.
In the embodiment of the present application, the step S400 specifically includes:
step S410, constructing a space-time neighborhood graph of each central point on each spliced ordered space-time domain;
And step S420, carrying out normalization processing and feature fusion on the point features in each space-time neighborhood graph to obtain the center point features corresponding to each subsequence.
Specifically, the ordered spatial sequences of all frames are spliced in time order to form an overall spatio-temporal ordered sequenceAndWherein,Representing the coordinates of the center point, L representing the sequence length,Representing the center point feature. Since the input constraints of the state space model are three-dimensional, including the batch size (batchsize), length (length), and number of points (number of points), it is necessary to multiply T and N, and combine them into one L dimension, which contains T intra framesIs a single point of the system.
The application is provided with a space-time structure aggregation layer, and a space-time neighborhood graph of each central point on each spliced ordered space-time is constructed by using the space-time structure aggregation layer. The spliced ordered time-space is provided with a plurality of center points, and the center points are obtained by sampling the input points through the farthest points. And after carrying out normalization processing on the point features in each space-time neighborhood graph, carrying out feature fusion through a multi-layer perceptron (MLP) to generate updated center point features. The specific formula is as follows:
;
;
;
;
。
Wherein, the Representing centre point features, i.e. reference,Representing the coordinates of the centre point, i.e. referring to. KNN represents the K nearest neighbor method. The coordinates of the center point and points within the neighborhood (simply called neighbors) in fig. 2 are denoted (x, y, z, t),Representing the difference between the adjacent point and the central point in the space x-axis after space-time embedding,Representing the difference between the adjacent point and the central point on the space y axis after space-time embedding,Representing the difference between the adjacent point and the central point in the space z-axis after the space-time embedding,Representing the difference between the adjacent point and the central point at time t after the space-time embedding,The feature of the adjacent point is represented,Representing intermediate results after normalization of the difference between the adjacent point features and the center point features,Representing a very small constant, typically used to avoid zero error or numerical stability,Representing the updated neighbor feature,Represents the natural logarithm (Natural Logarithm),Representing the central point characteristics obtained by updating after feature fusion, K representing the number of adjacent points obtained after KNN algorithm, i representing traversing from 1 to K adjacent points i times, j representing traversing from 1 to K adjacent points j times,A multi-layer perceptron is shown,Representation ofFeatures of the i-th point traversed to,Representation ofFeatures of the j-th point traversed to.
The embodiment of the application can extract and integrate the local space-time characteristics of the point cloud, and realizes the update of the central point characteristics by constructing the space-time neighborhood graph.
In one embodiment of the present application, the step S410 is specifically implemented to construct a spatio-temporal neighborhood graph of each center point on each of the spliced ordered time-periods by using a K nearest neighbor method and a spatio-temporal embedding method.
Specifically, the embodiment of the application uses an extended K Nearest Neighbor (KNN) method and combines a space-time embedding method to construct a space-time neighborhood graph of each central point.
As shown in fig. 1, the three-dimensional human behavior recognition method further includes the following steps:
And S500, acquiring low-order space-time features extracted from four-dimensional point cloud data to be analyzed, and obtaining an action recognition result based on the low-order space-time features and the central point features of each subsequence.
In the embodiment of the present application, the step S500 specifically includes:
step S510, inputting the low-order space-time features into a gesture encoder to obtain predicted skeleton key points;
step S520, inputting the skeleton key points into a gesture decoder to obtain high-dimensional geometric features;
And step S530, fusing the high-dimensional geometric features and the central point features of the subsequences to obtain an action recognition result.
Specifically, the embodiment of the application acquires low-order space-time features through shared point 4D convolution, inputs the low-order space-time features into a trained gesture encoder to obtain predicted skeleton key points, extracts high-dimensional geometric features by using a gesture decoder, and fuses the high-dimensional geometric features with the central point features of each subsequence after pooling to obtain a motion recognition result.
According to the embodiment of the application, the gesture perception feature optimization module is utilized to carry out auxiliary learning of gesture estimation, so that the perception capability of the model on the human body geometric structure and the motion mode is improved, and the recognition accuracy is improved.
In one embodiment of the application, the training step of the state space model comprises:
acquiring a training data set, wherein the training data set comprises four-dimensional point cloud training data and corresponding action labels;
Inputting four-dimensional point cloud training data into an initial state space model, and extracting training subsequences of different time scales in the four-dimensional point cloud training data;
sequencing each frame of three-dimensional point cloud in each training subsequence to obtain ordered space sequence training data corresponding to each frame of three-dimensional point cloud;
Splicing the ordered space sequence training data corresponding to the three-dimensional point clouds of all frames in each training subsequence according to time sequence to obtain spliced ordered space-time sequence training data corresponding to each training subsequence;
determining central point characteristic training data corresponding to each training subsequence according to the spliced ordered space-time sequence training data corresponding to each training subsequence;
And acquiring low-order space-time feature training data extracted from the four-dimensional point cloud training data, and training based on the low-order space-time feature training data, the central point feature training data of each training subsequence and the action labels to obtain a trained state space model.
According to the embodiment of the application, through cross-time sequence serialization, four-dimensional point cloud data are effectively serialized, and space and time information are considered, so that a state space model can capture complex space-time dependency relationships under a unidirectional modeling framework, and the understanding and recognition capability of the model on dynamic actions are improved through unified modeling. Compared with the traditional convolutional neural network and a transducer architecture, the method reduces the operation time and memory use, and particularly can improve the calculation efficiency when processing long-sequence four-dimensional point clouds. The method is suitable for application scenes with limited resources, is particularly suitable for human action recognition, can be expanded to a plurality of fields of robot navigation, automatic driving, intelligent monitoring and the like, and improves the action recognition and response capability of the system in a complex environment.
The state space model provided by the application effectively solves the problems of high calculation complexity and difficult space-time dependency capture in four-dimensional point cloud data analysis. The state space model is superior to the traditional method in calculation efficiency and memory use, and robustness and recognition accuracy in a complex environment are improved through a gesture sensing mechanism. As shown in fig. 3 and fig. 4, fig. 3 is a test result of the state space model of the present application on the test set, and fig. 4 is a schematic view of attention visualization of three-dimensional human behavior recognition in the present application.
In one embodiment of the present application, determining center point feature training data corresponding to each training sub-sequence according to the concatenated ordered spatio-temporal sequence training data corresponding to each training sub-sequence includes:
constructing a space-time neighborhood training diagram of each central point on each spliced ordered space-time sequence training data;
And carrying out normalization processing and feature fusion on the point features in each time-space neighborhood training diagram to obtain the central point feature training data corresponding to each training subsequence.
According to the embodiment of the application, the space-time neighborhood training diagram is constructed to realize efficient aggregation of local features, meanwhile, the state space model is responsible for capturing long-range dependence, so that the understanding of global space-time relationship is not sacrificed while the linear complexity of the model is maintained, and efficient local feature extraction and global dependence capture are realized.
In one embodiment of the application, the training data set further comprises real skeleton key points corresponding to four-dimensional point cloud training data;
Training based on the low-order space-time feature training data, the central point feature training data of each training sub-sequence and the action labels to obtain a trained state space model, wherein the training comprises the following steps:
inputting the low-order space-time feature training data into a gesture encoder to obtain predicted skeleton key points;
Calculating the mean square error loss between the predicted skeleton key points and the real skeleton key points corresponding to the four-dimensional point cloud training data so as to train the gesture encoder;
inputting the predicted skeleton key points into a gesture decoder to obtain high-dimensional geometric feature training data;
And fusing the high-dimensional geometric feature training data with the central point feature training data of each training subsequence, and training by taking the action label as a true value to obtain a trained state space model.
In particular, during the training phase, embodiments of the present application map low-order spatio-temporal feature training data to using a convolution-based pose encoderWhere kp is the number of skeleton keypoints, which is the information carried by the training dataset, such as 20 keypoints in the MSR Action3D dataset. And calculating the Mean Square Error (MSE) loss of the predicted skeleton key points and the real skeleton key points, and guiding model learning.
According to the embodiment of the application, by introducing the gesture estimation task, the learning ability of the model to the human skeleton structure and the motion mode is enhanced, and the robustness on sparse, incomplete and high-noise data sets is improved.
According to the embodiment of the application, the gesture perception feature optimization module is utilized to carry out auxiliary learning of gesture estimation, so that the perception capability of the model to the human body geometric structure and the motion mode is improved, and the robustness and the recognition accuracy on a sparse, incomplete and high-noise data set are enhanced.
In addition, in terms of hardware, to improve the calculation efficiency of the model, the present application can be implemented on a GPU (graphics processing unit), and utilize parallel computing power to accelerate the operations of CTS (cross-time sequential), STSAL (spatiotemporal structure aggregation layer) and SSM (state space model). In terms of software, the application can be implemented in a mainstream deep learning framework (such as PyTorch), and model development and training processes are simplified by using rich APIs (application programming interfaces) and optimization tools. And the optimization technologies such as mixed precision training (Mixed Precision Training) and gradient accumulation (Gradient Accumulation) can be adopted, so that the efficiency and stability of model training are improved. In the aspect of data processing, the method can also carry out filtering processing on the four-dimensional point cloud data, remove noise points and outliers, and improve the input quality and recognition accuracy of the model. Besides attitude estimation, other auxiliary tasks (such as point cloud classification and segmentation) can be introduced, and the feature expression capability of the model is further improved through multi-task learning. The model provided by the application also has compatibility and expansibility, can be compatible with the existing four-dimensional point cloud processing architecture, can be integrated into the existing system as a plug-in module, and improves the time-space modeling capability of the plug-in module. Therefore, the application can flexibly adapt to different application requirements and technical environments, and realizes high-efficiency and accurate four-dimensional point cloud data analysis and human action recognition.
In an embodiment, as shown in fig. 5, based on the three-dimensional human behavior recognition method, the invention further provides a three-dimensional human behavior recognition device, which includes:
The extraction module 100 is configured to input four-dimensional point cloud data to be analyzed into a trained state space model, and extract subsequences of different time scales in the four-dimensional point cloud data;
The ordering module 200 is configured to order each frame of three-dimensional point cloud in each sub-sequence to obtain an ordered spatial sequence corresponding to each frame of three-dimensional point cloud;
The splicing module 300 is configured to splice the ordered spatial sequences corresponding to the three-dimensional point clouds of all the frames in each sub-sequence according to time sequence, so as to obtain a spliced ordered space-time sequence corresponding to each sub-sequence;
A determining module 400, configured to determine a center point feature corresponding to each sub-sequence according to the spliced ordered space-time sequence corresponding to each sub-sequence;
The recognition module 500 is configured to obtain low-order space-time features extracted from four-dimensional point cloud data to be analyzed, and obtain an action recognition result based on the low-order space-time features and central point features of each subsequence.
It should be noted that the foregoing explanation of the embodiment of the three-dimensional human behavior recognition method is also applicable to the three-dimensional human behavior recognition device of this embodiment, and will not be repeated here.
The application discloses a three-dimensional human behavior recognition device, which is characterized by comprising the steps of inputting four-dimensional point cloud data to be analyzed into a trained state space model, extracting subsequences with different time scales in the four-dimensional point cloud data, sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud, splicing the ordered space sequences corresponding to the three-dimensional point clouds in each subsequence according to time sequences to obtain spliced ordered space-time sequences corresponding to each subsequence, determining central point features corresponding to each subsequence according to the spliced ordered space-time sequences corresponding to each subsequence, acquiring low-order space-time features extracted in the four-dimensional point cloud data to be analyzed, and obtaining a motion recognition result based on the low-order space-time features and the central point features of each subsequence. According to the application, the three-dimensional point clouds of each frame are sequenced to obtain an ordered space sequence, so that space and time information is considered, complex space-time dependency relationships can be captured, the calculation complexity is reduced through a trained state space model, and the accuracy and efficiency of a behavior recognition result are improved.
Fig. 6 is a schematic structural diagram of a terminal according to an embodiment of the present application. The terminal may include:
Memory 501, processor 502, and a computer program stored on memory 501 and executable on processor 502.
The three-dimensional human behavior recognition method provided in the above-described embodiment is implemented when the processor 502 executes a program.
Further, the terminal further includes:
a communication interface 503 for communication in the memory 501 and the processor 502.
Memory 501 for storing a computer program executable on processor 502.
The memory 501 may include high-speed RAM memory and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 501, the processor 502, and the communication interface 503 are implemented independently, the communication interface 503, the memory 501, and the processor 502 may be connected to each other via a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (PERIPHERA L Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the figures are shown with only one line, but not with only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 501, the processor 502, and the communication interface 503 are integrated on a chip, the memory 501, the processor 502, and the communication interface 503 may perform communication with each other through internal interfaces.
The processor 502 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the application.
The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the three-dimensional human behavior recognition method as above.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can read instructions from and execute instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include an electrical connection (an electronic device) having one or more wires, a portable computer diskette (a magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware as in another embodiment, it may be implemented using any one or combination of techniques known in the art, discrete logic circuits with logic gates for implementing logic functions on data signals, application specific integrated circuits with appropriate combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
Those of ordinary skill in the art will appreciate that all or part of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, and the program may be stored in a computer readable storage medium, where the program when executed includes one or a combination of the steps of the method embodiments.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.
The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.
Claims (7)
1. A method for three-dimensional human behavior recognition, the method comprising:
inputting four-dimensional point cloud data to be analyzed into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data;
sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud;
splicing the ordered space sequences corresponding to the three-dimensional point clouds of all frames in each subsequence according to time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence;
determining the central point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence;
Acquiring low-order space-time characteristics extracted from four-dimensional point cloud data to be analyzed, and acquiring an action recognition result based on the low-order space-time characteristics and the central point characteristics of each subsequence;
determining the central point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence, wherein the central point characteristic comprises:
constructing a space-time neighborhood graph of each central point on each spliced ordered space-time domain;
carrying out normalization processing and feature fusion on the point features in each space-time neighborhood graph to obtain the center point feature corresponding to each subsequence;
Constructing a space-time neighborhood graph of each central point on each spliced ordered space-time, comprising:
constructing a space-time neighborhood graph of each central point on each spliced ordered space-time by using a K nearest neighbor method and a space-time embedding method;
obtaining an action recognition result based on the low-order space-time features and the central point features of the sub-sequences, wherein the action recognition result comprises the following steps:
Inputting the low-order space-time features into a gesture encoder to obtain predicted skeleton key points;
inputting the skeleton key points into a gesture decoder to obtain high-dimensional geometric characteristics;
and fusing the high-dimensional geometric features with the central point features of the subsequences to obtain an action recognition result.
2. The three-dimensional human behavior recognition method according to claim 1, wherein the training step of the state space model comprises:
acquiring a training data set, wherein the training data set comprises four-dimensional point cloud training data and corresponding action labels;
Inputting four-dimensional point cloud training data into an initial state space model, and extracting training subsequences of different time scales in the four-dimensional point cloud training data;
sequencing each frame of three-dimensional point cloud in each training subsequence to obtain ordered space sequence training data corresponding to each frame of three-dimensional point cloud;
Splicing the ordered space sequence training data corresponding to the three-dimensional point clouds of all frames in each training subsequence according to time sequence to obtain spliced ordered space-time sequence training data corresponding to each training subsequence;
determining central point characteristic training data corresponding to each training subsequence according to the spliced ordered space-time sequence training data corresponding to each training subsequence;
And acquiring low-order space-time feature training data extracted from the four-dimensional point cloud training data, and training based on the low-order space-time feature training data, the central point feature training data of each training subsequence and the action labels to obtain a trained state space model.
3. The method of claim 2, wherein determining the center point feature training data corresponding to each training sub-sequence based on the concatenated ordered spatio-temporal sequence training data corresponding to each training sub-sequence comprises:
constructing a space-time neighborhood training diagram of each central point on each spliced ordered space-time sequence training data;
And carrying out normalization processing and feature fusion on the point features in each time-space neighborhood training diagram to obtain the central point feature training data corresponding to each training subsequence.
4. The method for recognizing three-dimensional human behavior according to claim 2, wherein the training data set further comprises real skeleton key points corresponding to four-dimensional point cloud training data;
Training based on the low-order space-time feature training data, the central point feature training data of each training sub-sequence and the action labels to obtain a trained state space model, wherein the training comprises the following steps:
inputting the low-order space-time feature training data into a gesture encoder to obtain predicted skeleton key points;
Calculating the mean square error loss between the predicted skeleton key points and the real skeleton key points corresponding to the four-dimensional point cloud training data so as to train the gesture encoder;
inputting the predicted skeleton key points into a gesture decoder to obtain high-dimensional geometric feature training data;
And fusing the high-dimensional geometric feature training data with the central point feature training data of each training subsequence, and training by taking the action label as a true value to obtain a trained state space model.
5. A three-dimensional human behavior recognition device, the device comprising:
the extraction module is used for inputting four-dimensional point cloud data to be analyzed into a trained state space model, and extracting subsequences of different time scales in the four-dimensional point cloud data;
the sequencing module is used for sequencing each frame of three-dimensional point cloud in each subsequence to obtain an ordered space sequence corresponding to each frame of three-dimensional point cloud;
The splicing module is used for splicing the ordered space sequences corresponding to the three-dimensional point clouds of all the frames in each subsequence according to the time sequence to obtain a spliced ordered space-time sequence corresponding to each subsequence;
The determining module is used for determining the center point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence;
the identification module is used for acquiring low-order space-time characteristics extracted from four-dimensional point cloud data to be analyzed, and obtaining an action identification result based on the low-order space-time characteristics and the central point characteristics of each subsequence;
determining the central point characteristic corresponding to each subsequence according to the spliced ordered space-time sequence corresponding to each subsequence, wherein the central point characteristic comprises:
constructing a space-time neighborhood graph of each central point on each spliced ordered space-time domain;
carrying out normalization processing and feature fusion on the point features in each space-time neighborhood graph to obtain the center point feature corresponding to each subsequence;
Constructing a space-time neighborhood graph of each central point on each spliced ordered space-time, comprising:
constructing a space-time neighborhood graph of each central point on each spliced ordered space-time by using a K nearest neighbor method and a space-time embedding method;
obtaining an action recognition result based on the low-order space-time features and the central point features of the sub-sequences, wherein the action recognition result comprises the following steps:
Inputting the low-order space-time features into a gesture encoder to obtain predicted skeleton key points;
inputting the skeleton key points into a gesture decoder to obtain high-dimensional geometric characteristics;
and fusing the high-dimensional geometric features with the central point features of the subsequences to obtain an action recognition result.
6. A terminal, characterized by comprising a memory, a processor and a three-dimensional human body behavior recognition program stored on the memory and operable on the processor, wherein the three-dimensional human body behavior recognition program when executed by the processor implements the steps of the three-dimensional human body behavior recognition method according to any one of claims 1 to 4.
7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program executable for implementing the steps of the three-dimensional human behavior recognition method according to any one of claims 1 to 4.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510588159.8A CN120088870B (en) | 2025-05-08 | 2025-05-08 | Three-dimensional human behavior recognition method, device, terminal and medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202510588159.8A CN120088870B (en) | 2025-05-08 | 2025-05-08 | Three-dimensional human behavior recognition method, device, terminal and medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN120088870A CN120088870A (en) | 2025-06-03 |
| CN120088870B true CN120088870B (en) | 2025-07-11 |
Family
ID=95853303
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202510588159.8A Active CN120088870B (en) | 2025-05-08 | 2025-05-08 | Three-dimensional human behavior recognition method, device, terminal and medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN120088870B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120388411A (en) * | 2025-06-27 | 2025-07-29 | 湖南超能机器人技术有限公司 | Personnel identification and behavior detection method, device and security robot |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116721207A (en) * | 2023-05-30 | 2023-09-08 | 中国科学院深圳先进技术研究院 | Three-dimensional reconstruction method, device, equipment and storage medium based on transducer model |
-
2025
- 2025-05-08 CN CN202510588159.8A patent/CN120088870B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116721207A (en) * | 2023-05-30 | 2023-09-08 | 中国科学院深圳先进技术研究院 | Three-dimensional reconstruction method, device, equipment and storage medium based on transducer model |
Also Published As
| Publication number | Publication date |
|---|---|
| CN120088870A (en) | 2025-06-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Panek et al. | Meshloc: Mesh-based visual localization | |
| Zhang et al. | A review of deep learning-based semantic segmentation for point cloud | |
| CN111401406B (en) | A neural network training method, video frame processing method and related equipment | |
| US12051221B2 (en) | Method and apparatus for pose identification | |
| Cui et al. | 3D semantic map construction using improved ORB-SLAM2 for mobile robot in edge computing environment | |
| Bujanca et al. | Slambench 3.0: Systematic automated reproducible evaluation of slam systems for robot vision challenges and scene understanding | |
| CN114445549B (en) | Three-dimensional dense surface element mapping method based on SLAM and its system and electronic equipment | |
| CN110837811A (en) | Method, device and equipment for generating semantic segmentation network structure and storage medium | |
| US12277717B2 (en) | Object detection method and system, and non-transitory computer-readable medium | |
| WO2021203865A1 (en) | Molecular binding site detection method and apparatus, electronic device and storage medium | |
| Zhao et al. | A directional-edge-based real-time object tracking system employing multiple candidate-location generation | |
| Ma et al. | Loop-closure detection using local relative orientation matching | |
| CN112199994B (en) | A method and device for real-time detection of 3D hands interacting with unknown objects in RGB video | |
| CN120088870B (en) | Three-dimensional human behavior recognition method, device, terminal and medium | |
| US20230229916A1 (en) | Scalable tensor network contraction using reinforcement learning | |
| Chen et al. | Underwater target detection and embedded deployment based on lightweight YOLO_GN | |
| CN113610856B (en) | Method and device for training image segmentation model and image segmentation | |
| CN111797862A (en) | Task processing method, device, storage medium and electronic device | |
| US20250232451A1 (en) | Detecting moving objects | |
| CN118674752A (en) | Real-time target tracking method based on twin network and embedded device | |
| CN119249872A (en) | Point cloud-based robot environment simulation reconstruction method, device, equipment and medium | |
| CN115131871B (en) | A gesture recognition system, method and computing device | |
| US20240296650A1 (en) | Sample-adaptive 3d feature calibration and association agent | |
| Lu et al. | Camera Absolute Pose Estimation Using Hierarchical Attention in Multi-Scene | |
| Zhan et al. | Accelerate point cloud structuring for deep neural networks via fast spatial-searching tree |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |