[go: up one dir, main page]

skip to main content
research-article
Open access

AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

Published: 13 June 2024 Publication History

Abstract

This article focuses on motion prediction for point cloud sequences in the challenging case of deformable 3D objects, such as human body motion. First, we investigate the challenges caused by deformable shapes and complex motions present in this type of representation, with the ultimate goal of understanding the technical limitations of state-of-the-art models. From this understanding, we propose an improved architecture for point cloud prediction of deformable 3D objects. Specifically, to handle deformable shapes, we propose a graph-based approach that learns and exploits the spatial structure of point clouds to extract more representative features. Then, we propose a module able to combine the learned features in a adaptative manner according to the point cloud movements. The proposed adaptative module controls the composition of local and global motions for each point, enabling the network to model complex motions in deformable 3D objects more effectively. We tested the proposed method on the following datasets: MNIST moving digits, the Mixamo human bodies motions [15], JPEG [5] and CWIPC-SXR [32] real-world dynamic bodies. Simulation results demonstrate that our method outperforms the current baseline methods given its improved ability to model complex movements as well as preserve point cloud shape. Furthermore, we demonstrate the generalizability of the proposed framework for dynamic feature learning by testing the framework for action recognition on the MSRAction3D dataset [19] and achieving results on par with state-of-the-art methods.

1 Introduction

Point cloud sequences are a flexible and rich geometric representation of volumetric content used in a wide range of applications from autonomous driving [26, 30], robotics [24, 50] to virtual/mixed-reality services [4, 34]. Such sequences consist of consecutive point clouds, each composed of an unordered collection of 3D points representing 3D scenes or 3D objects. Although the point cloud is a highly appealing representation impacting multiple sectors, how to properly process it is still an open challenge. One of the most successful methodologies was the development of neural networks able to learn directly from unstructured point cloud data. This approach was pioneered by PointNet [6] architecture, which learns features by processing each point independently. However, in such architecture, the local structures which contain key semantic information of 3D geometry are not captured. PointNet++ [31] addresses this issue by considering neighbourhoods of points instead of acting on each one independently. To this end, PointNet++ employs a hierarchical architecture that processes process point neighbourhoods at increasingly larger scales along a multi-resolution hierarchy, as shown on the left side of Figure 1. This approach groups the local features learned from small neighbourhoods into larger neighbourhoods and processes them to learn higher-level features. By learning hierarchical features, the network can abstract the multiple local-to-global structures within the data. Although PointNet++ hierarchical architecture was initially designed for static point clouds, it has since been extended to the case of dynamic point clouds [22, 23]. In these cases, instead of extracting features from neighbourhoods in a single point cloud, the network extracts dynamic features from a hierarchy of spatio-temporal neighbourhoods across time. The learned dynamic features can be applied to a wide range of downstream tasks, such as action classification, motion prediction and segmentation. In this article, we focus on the point cloud prediction task. Specifically given a point cloud sequence \(\mathcal {P}=\lbrace P_{1}, \ldots , P_{T}\rbrace\), composed of T frames with \(p_{i,t} \in \mathbf {R}^3\) being the Euclidean coordinates of point i in point cloud \(P_t \in \mathbf {R}^{N \times 3}\), our goal is to predict the coordinates of future point clouds (\(\hat{P}_{T+1},\ldots ,\hat{P}_{T+Q}\)), where Q is the prediction horizon.
Fig. 1.
Fig. 1. Proposed deep learning method for motion prediction of dynamic point clouds. Left: By sequentially processing the point cloud at multiple resolutions, the neural network learns hierarchical features. Middle: The hierarchical features correspond to hierarchical motions (from local to global). Right: To predict complex movements, hierarchical motions/features are then combined in an adaptative fashion via learnable weights.
At the moment, point-based hierarchical methods can be considered the de-facto state-of-the-art approach for point cloud prediction. However, while these methodologies have shown good performance when predicting simple and rigid movements as translations in automobile scenes [3], they are often limited when predicting the motion of 3D deformable objects. Addressing this limitation is the main goal of this article. Predicting deformable objects is challenging since the point cloud shape changes over time and performs highly complex motions. For example, in a 3D representation of a football player running or a dancer performing during a music event, their point cloud representations change over time following different postures. Moreover, the performed movements are not rigid transformations but rather a combination of multiple and diverse local motions. For instance, if we imagine the player raising the hand while running, their arm and hand will be characterized by a combination of movements (i.e., local rising movement and the global forward translation). Given their characteristics, processing 3D deformable objects presents two major challenges: (i) establishing point correspondence across time and preserving the shape of the predicted point cloud; (ii) generating accurate motion predictions that are a composition of multiple movements at different levels of resolution.
To address the above challenges, we must first understand if the current state-of-the-art models are able to address such challenges. Within this context, we first demonstrate the model’s inability to establish precise temporal correlations and preserve the predicted point cloud shape. This is because they fail to consider the structural relationships between the points during the learning process. Then, to investigate the challenge of predicting complex motions, we employ the explainability techniques introduced in our previous work [12]. These techniques demonstrated that the hierarchy of dynamic features corresponds to learning from local to global motions (in the centre of Figure 1). In this article, we build upon this interpretation to identify the technical limitations of the current framework approach. Specifically, we show that in most methodologies [7, 11, 22, 29] to generate predictions of future motions, the hierarchical features are combined via learnable weights. Most critically, to preserve permutation invariance, when combining hierarchical features, the same learned weights are applied to all points across frames. However, in deformable objects, not all points benefit from the same combination of hierarchical features. For example, some points can be described entirely by global motions, while other points are better described by a combination of global and local motions. We show that this fixed combination of hierarchical features is a key limitation to the network’s ability to predict complex motions.
Based on the limitations identified above, we propose AGAR: an attention-based hierarchical graph-recurrent neural network (RNN) for point cloud prediction of deformable objects. Our proposed architecture includes an initial graph-based module that extracts the underlying geometric structure of the input point cloud as spatial features. From the learned spatial features, we construct a spatio-temporal graph that forms more representative neighbourhoods than current methods that neglect the point cloud structure. The graph is then processed by sequential graph-RNN cells that take structural relations between points into account to learn dynamic features. To address the limitation of the fixed combination of hierarchical features, we propose a novel module denoted as Adaptative feature combination. The proposed module employs an attention mechanism to dynamically assign different degrees of importance to each level of hierarchical features. As such, for each point, the network can control the composition of the local and global motions that best describe the point behaviour. This concept is illustrated in the right part of Figure 1, where the network selects the regions that benefit from particular motions (i.e., local, semi-local, global) instead of blindly combining all the motions learned in the multiple hierarchical levels. Besides improving the prediction of complex motions, the Adaptative feature combination module is also an explainability tool. The module allows us to visualize the influence of each learned feature on the predicted motion, providing a deeper understanding of the network’s internal workings.
The proposed method is trained in a self-supervised fashion, and it is tested on several datasets such as the Mixamo synthetic human bodies activities dataset [15], JPEG [5], and CWIPC-SXR [32] real-world human bodies datasets and compared against state-of-the-art methods. To extend such comparison, we also tested on a dataset of rigid objects (moving MNIST point cloud dataset [7]) and a dataset of automobile scenes (Argoverse dataset [3]). A powerful strength of our framework is the ability to extract the general dynamic behaviour of the point cloud as dynamic features. Since such features are useful for downstream tasks, we also tested the proposed architecture for the action recognition task on human bodies (MSRAction3D dataset [19]). The proposed method outperforms the state-of-the-art methods in human bodies prediction and achieves on-par results for rigid objects and automobile scene prediction as well as for the action recognition task. The results demonstrated that our proposed method can leverage the structural relations between points to learn more accurate representations and preserve the point cloud shape during prediction. The results further show that the proposed adaptative feature combination module predicts complex motions in human bodies with more accuracy than the current state-of-the-art approaches. Lastly, the code and datasets required to reproduce the work are made publicly available.1
In summary, the key contributions of our work are:
Understanding of current state-of-the-art frameworks key limitation for generating motion flow prediction. We show how the current approach is equivalent to combining learned local and global motions without regard to the point position in space and time and how this strategy fails to model the complex motions present in deformable objects.
A novel module that combines hierarchical features in an adaptative manner according to the scene context. The proposed module dynamically controls the composition of local and global motions for each point, allowing the network to predict complex motions with higher accuracy and flexibility. This also offers an explainability tool.
A graph-based module that exploits the point cloud geometric structure to form spatio-temporal neighbourhoods from where the meaningful dynamic features can be extracted. The structural information is further included in the learned dynamic features, reducing the deformation of the predicted point cloud shape.
The remainder of this article is organized as follows: In Section 2, we provide a state-of-the-art of research for point cloud prediction. In Section 3, we study the hierarchical component and identify the limitations of the state-of-the-art prediction framework. Based on the limitations identified in Section 4, we propose AGAR, an improved architecture with graph-RNN cells and a novel Adaptative feature combination module. Section 5 describes implementation details. Finally, the experimental results and conclusion are presented in Sections 6 and 7, respectively.

2 Background

This section provides an overview of the research in dynamic point cloud processing (Section 2.1), followed by a detailed description of the current state-of-the-art point cloud prediction framework and the notation used throughout this article (Section 2.2).

2.1 Related Works

In the current literature, dynamic cloud processing has been approached from multiple overlapping directions related to motion prediction (e.g., segmentation and action recognition). These high-level tasks share a common challenge: extraction of temporal correlations between sequential point cloud frames, challenged by the irregular structure and by the lack of explicit point-to-point correspondence across time. In the following, we summarize the approaches proposed in the literature to overcome such challenges and how they lead to the development of the current state-of-the-art framework for point cloud prediction.
An initial approach to learn from irregularly structured data such as point cloud data was to convert them into a regular representation such as 2D multi-view [27, 51] or 3D voxels [39, 43] and then process the converted data with traditional neural networks. This approach, however, suffered from high memory consumption and quantization errors. Within this context, the hierarchical architecture proposed in PointNet++ [31], able to process raw point cloud data directly, has become a pillar of work for learning-based point cloud processing. The PointNet++ hierarchical architecture has been extended to dynamic point clouds, by introducing spatio-temporal neighbourhoods to extract dynamic features. The spatio-temporal neighbourhoods still lack explicit point-to-point correspondence over time. However, by processing the neighbourhoods at multiple scales, the network can capture temporal correlations that would otherwise be hidden. This hierarchical learning strategy has proved to be highly successful at learning from point cloud sequences and has been widely adopted throughout the literature [9, 10, 17, 22, 23, 28, 38, 45]. In PSTNet [9], a hierarchical architecture is used for the action classification of point cloud sequences. In PointPWC-Net [45], a hierarchical architecture learns motion in a course-to-fine fashion by learning a motion flow and a cost function between two adjacent frames at each hierarchical level. More recently, attention-based mechanisms have been incorporated into hierarchical architectures [8, 37, 40, 42]. The use of attention allows the network to selectively focus on the most important parts of the point cloud. Although attention mechanisms do not fully address point-to-point correspondence, they allow for a more flexible construction of hierarchical neighbourhoods by enabling selective aggregation within the network. For example, in [37], an attention mechanism is used to sample the most critical points, enabling the network to better correspond points over time. In [8], attention is incorporated into the spatio-temporal point aggregation, assigning greater weight to points that are more similar to the target point during the feature aggregation. It is worth noting these aforementioned attention-based works, learn the attention of a point relative to the features of other points, with the goal of improving the extracted features. We, on the other hand, propose to learn the attention of a point relative to the features of each hierarchical level with the goal of refining the predicted motion.
Although the methods presented above have demonstrated their ability to extract features from point cloud sequences, they suffer from several drawbacks when specifically applied to the point cloud prediction task, which is the focus of this article. For instance, methods such as PointPWC-Net [45] learn a motion flow between two adjacent frames instead of learning a future motion to predict the next frames, preventing the model from capturing long-term movements. Other methods, such as PSTNet [9], are able to capture long-term correlations by processing all the sequence frames simultaneously. While this is an effective approach for classification or segmentation tasks, the memory required to process all the frames simultaneously prevents this approach from being scaled to long sequences or applied to iterative prediction tasks. These drawbacks led to the implementation of point-based hierarchical architectures into RNNs or their variants, e.g., Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU). These types of models are designed to model sequential data, taking only one frame as input at each interaction. The key characteristic of RNNs is their hidden states that can act as a memory. The states store information from prior inputs and are continuously updated. As a result, the output of RNNs depends on the input but also the prior elements within the sequence. Pioneer in this framework, PointRNN [7] learns dynamic features from spatio-temporal neighbourhoods between two adjacent frames. The learned dynamic features are then used as states storing the point’s history of movements to learn the next interaction features as well as to predict future movements. This methodology inherits the ability to capture the long-term dynamic behaviour of sequential data from RNN models while having low memory requirements. Following this approach, several works [7, 25, 44, 48, 49] combined RNN cells or its variants into hierarchical architectures to model point cloud sequences. These point-based hierarchical RNN architectures have been highly successful in modelling all types of irregular data sequences from traffic networks [33] to point-of-interest in images [13] and are currently the state-of-the-art approach for iterative point cloud prediction. However, the majority of current point-based methods proposed in the literature are focused on predicting the motion of point clouds from rigid objects, leaving the unique challenges associated with predicting point clouds from deformable objects overlooked.
It is worth noting that while deformable objects have been overlooked in point-based processing, they have been widely studied in robotics [16, 21, 47] and physics simulations [20]. In these fields, a point cloud representation is commonly used in conjunction with prior knowledge of the object’s physics or structure. In [16, 21, 47], human body motions are modelled using a skeleton-based representation. In [20], fluid and elastic object dynamics are modelled by imposing physical behaviour. In our work, no prior knowledge is given. We are given neither the skeleton representations of the human body nor the psychical laws on the points of behaviour. Our work aims at addressing the gap in point cloud processing methods when handling deformable objects. To this end, we first identify the current challenges caused by such objects and develop models specifically designed to handle such challenges.

2.2 Hierarchical Point-based RNN Architecture for Point Cloud Prediction

In this section, we present an architecture that characterizes the state-of-the-art hierarchical RNN framework used for point cloud prediction. We will use this model to identify the key challenges of current state-of-the-art (Section 3) and to highlight the novelty of the solutions proposed in this article (Section 4). Table 1 summarizes the main notation used throughout the article. Without loss of generality, we describe the iterative prediction framework depicted in Figure 2. Given a point cloud sequence \(\mathcal {P}\), at each interaction, the network processes one input point cloud \(P_t\in \mathbf {R}^{N \times 3}\) and outputs the prediction of the point cloud at the next time step \(\hat{P}_{t+1}\). The framework can be described by three main phases:
Table 1.
TerminologyDescription
levelnetwork layer extracting dynamic features at a specific resolution.
spatial featuresvectors describing the point’s local geometric structure.
dynamic featuresvectors describing the point’s dynamic behaviour.
ParameterDescription
\(\mathcal {P}, T\)sequence of point clouds, and number of points clouds (frames) in sequence
\(l, L, k\)level, the total number of levels and points per neighbourhood.
\(N, N^l\)original number of points and number of points at a level l.
\(P^l_t\: p_{i,t} \in P_t\)point cloud and cartesian coordinates of point i
\(\hat{P}^l_t\: \hat{p}_{i,t} \in \hat{P}_t\)predicted point cloud and cartesian coordinates of predicted point i
\(S^l_t \: s^l_{i,t} \in S^l_t\)point cloud spatial features and spatial feature of point i.
\(D^l_t \: d^l_{i,t} \in D^l_t\)point cloud dynamic features and dynamic feature of point i.
\(M_t, m_{i,t} \in M_t\)point cloud motion vectors and motion vector of point i.
\(D^{{\text{Final}}}_t \: d^{{\text{Final}}}_{i,t} \in D^{{\text{Final}}}\)point cloud final dynamic features and final feature of point i.
\(\Theta _{\text{FP}},\Theta _\text{S}, \Theta _\text{D}, \Theta _\text{R}, \Theta _\alpha\)learnable network weights
\(G^\text{C}_t,G^{\text{ST},l}_t\)coordinate and spatio-temporal graph.
\(\alpha ^{l}_{i}\)attention value of point i to the feature of level l.
Table 1. Terminology and Notation
Fig. 2.
Fig. 2. Generic state-of-the-art framework for point cloud prediction for interaction at time t. The architecture is composed of a Dynamic Extraction (DE) phase, a Feature Propagation (FP) phase and a prediction phase.
(1)
Dynamic Extraction (DE) phase: the network processes the input point cloud \(P_t\) and extracts the point cloud dynamic as multiple L levels of hierarchical features (\(D^1_t, \ldots , D^L_t\)).
(2)
Feature Propagation (FP) phase: combines the learned features from multiple levels into a single final dynamic feature \(D^{\text{Final}}_t\);
(3)
Prediction phase: The final features are converted via a fully-connected layer into motion vectors \(M_t\) and added to the input point \(P_t\) cloud to predict the next point cloud \(\hat{P}_{t+1}\).
We now describe the DE and FP phases in more detail. Being straightforward we omit the detailed description of the prediction phase.

2.2.1 Dynamic Extraction (DE) Phase.

Depicted on the left part of Figure 2, the DE phase consists of multiple sequential RNN cells, for a total of L levels (in the figure \(L= 3\)). Before being processed by each RNN cell, the point cloud is downsampled by a Sampling and Grouping (SG) module, as described in [31]. At each RNN cell, for each point, a dynamic feature is extracted by aggregating information from the point spatio-temporal neighbourhood. In the majority of methods [22, 25, 29] the neighbourhood of each point is defined as the k nearest neighbour (k-nn) points in the previous frame, where the proximity is measured using the Euclidean distance between point 3D coordinates. The RNN cells are sequentially stacked in order to have the dynamic features learned at an RNN cell be the input of the next RNN cell. It is worth noting that the subsequent sampling, which results in a sparser point cloud at later levels/RNN cells, is responsible for the creation of hierarchical neighbourhoods with a progressively larger geometric distance between points. Thus, the first level (\(l=1\)) learns local dynamic features \(D^1_t\) from small-scale neighbourhoods, whereas the last level \(l=L\) learns global dynamic features \(D^L_t\) observing large-scale neighbourhoods.

2.2.2 Feature Propagation (FP) Phase.

Once the DE phase has learned the features from all the levels (\(D_t^{1}, \ldots , D_t^{L}\)), the FP phase combines them into a single final feature (\(D_t^{Final}\)). Currently, the most popular architecture for feature combination is the original architecture proposed in PointNet++[31], which is also found in most state-of-the-art methods without significant differences. We will refer to this architecture as state-of-the-art Classic-FP (depicted in the green side of Figure 2). In the Classic-FP the features combination is done by hierarchically propagating the features from the higher levels to the lower levels using several FP modules [31]. At each module, the sub-sampled features from the higher level are first interpolated to the same number of points as the lower level. The interpolation is done by weighted aggregation of the features of the three closest j points in the sub-sampled point cloud as such:
\begin{equation} \tilde{d}^{l}_{i,t} = \frac{\sum _{j=1}^3 dist_{ij,t} \times d_{i,t}^{l+1} }{ \sum _{j=1}^3 dist_{ij,t} }, \quad \; dist_{ij,t} = \frac{1}{||p^l_{i,t}- p^{l+1}_{j,t}||^2 } , \end{equation}
(1)
where \(\tilde{d}^{l}_{i,t} \in \tilde{D}_t\) is interpolated features from the number of points at level \(l+1\) to the number of points at level l. The interpolated high-level features are then concatenated with a skip-linked connection to lower-level features at the same number of points. The concatenation is processed by a point-based network that processes each point independently via shared weights \(\Theta ^{l}_{FP}\) as \(D^{l^{\text{FP}}}_{t} = \text{ReLU}\,(\Theta ^{l}_{FP}\,\lbrace D_{t}^{l} ; \tilde{D}_{t}^{l} \rbrace)\). The process is repeated in a hierarchical manner until the features from all the levels have been combined into final features (\(D_t^{\text{Final}})\).

3 Challenges and Limitations

The hierarchical point-based RNN framework, presented in the previous section, suffers several limitations when facing the challenge of processing deformable objects such as human-body-like sequences. In this article, we explain why those challenges arise and how to overcome them. In the following, we disentangle the challenges of current models as (i) challenges in processing/predicting objects with deformable shapes (Section 3.1); (ii) challenges in predicting complex motions (Section 3.2). Taking advantage of the understanding built in this section, in Section 4 we introduce our proposed method, built to overcome the main limitations identified here.

3.1 Challenges in Processing Deformable Shapes

The main challenges encountered in processing and predicting objects with deformable shapes, such as clothing, food, or human bodies are (i) having a semantically-meaningful point-to-point correspondence (used to learn dynamic features); (ii) avoiding shape distortion (which is highly noticeable in 3D objects and therefore of high negative impact on cloud prediction quality).
The challenge of establishing point-to-point correspondence is present in any point cloud processing, but it is clearly exacerbated in the case of deformable 3D objects. The majority of current works follow the same strategy as PointRNN [7] and assume that the points in the current frame are matched with points in close proximity in the previous frame. This proximity is built in the 3D Euclidean space. However, in 3D deformable objects, points that are geometrically close in space are not necessarily semantically correlated and do not necessarily belong to the same segment of the object. Figure 3, shows three examples of how matching based on geometric proximity can lead to the creation of misleading neighbourhoods. This means that point correspondence across time is challenged by the mismatch between Euclidean proximity and semantically-meaningful proximity.
Fig. 3.
Fig. 3. Example of matching points across time using geometric coordinates for three sequences: Running, Diving, Jumping from [15] (These sequences are examples of particularly high motions for visualization purpose). The dashed circles show a zoom-in of the regions where grouping using coordinates would create incorrect neighbourhoods. For example, in the Running sequence, the points in the foot at time t are incorrectly matched with the points in lower-leg at \(t-1\).
On the other hand, current methods often struggle to preserve the predicted point cloud shape. This is mainly due to the fact that a separate motion vector is learned for every point with no clear semantic constraints. If these motion vectors vary significantly among neighbouring points, the result is a prediction with a deformed shape. This issue can be tackled by imposing hard shape constraints, such as learning a single motion vector for all the points in a region. However, this strategy can only be applied to rigid objects. In deformable objects, the object shape changes according to different postures, meaning points must be allowed to have separate motions. Thus, it is important to strike a balance between preserving the shape and having enough per-point motion flexibility to predict possible shape variations. The key to achieving this balance is to capture the underlying semantic structure and take it into account as a soft shape constraint during the learning process.
Both challenges of point correspondence and shape deformation can be summarized in the following limitation: Lack of structural relationship between points in point cloud prediction (Limitation 1). Learning and exploiting this prior in the learning process is one of the novelties of our proposed model and it will be specifically addressed by learning a semantically-meaningful graph and exploiting this graph when extracting features (via graph-RNN cell).

3.2 Challenges in Processing Complex Motions

A second key challenge present in processing 3D dynamic objects as the human body is that the movement of such objects is usually a complex motion. Complex motions refer to movements that involve a combination of multiple degrees of freedom such as translation, rotation, and deformation, which are applied to different parts of the object independently. This is typical of deformable objects or any 3D objects with disjoint components, each of them with its own movement. As an example, consider a point cloud representing a human body running forward (Figure 4(a)-Man-Running). While the full body moves forward (translation), the person swings their arms (rotation), and their hand bends from an open to a closed position (shape change). The complex nature of such movements makes them challenging to be accurately captured and predicted. Based on a novel visualization technique that we introduced in our previous work [12] on explainability, we now highlight key limitations of the current architectures. Specifically, we show how complex motions can be seen as a sum of low-, medium- and high-level motion leading to an understanding that the current model suffers from the following main limitation: the fixed combination of hierarchical features in the prediction phase (Limitation 2). We now explain this limitation in more detail.
Fig. 4.
Fig. 4. Hierarchical of dynamic features as motion vectors given for two input sequences (Man-Running and Woman-Running). For each sequence, the figure shows input dynamic point cloud, multi-scale neighbourhood at different levels, and motion vectors learned at each level of the network.
In our explainability work [12], we have demonstrated that motion vectors inferred by hierarchical architectures (Figure 2) can be disentangled into individual motion vectors produced at each hierarchical level, as follows:
\begin{equation} M_t =\sum _{l}^L M_t^l, \quad \text{where} \: M^l_t = Classic^{l}_{FP} \left(D^l_t \right). \end{equation}
(2)
Where \(Classic^{l}_{FP}\) is the function that replicates the operation of the Classic-FP in a disentangle manner, converting the learned feature at each level l to an individual motion vector \(M_t^l\), and \(M_t\) is the final predicted motion vectors outputted by the network. This leads to the interpretation that current approaches in the literature model complex motions as a combination of local and global motions, which are learned as hierarchical dynamic features. This is illustrated in Figure 4, which depicts the dynamic features as motion vectors and the hierarchical neighbourhoods given two point cloud sequences as input to a state-of-the-art prediction architecture (presented in Figure 2) with three levels (\(L=3\)) [12]. In both sequences, it can be seen that the lower level learns features only by looking at points in a small area (top gold squares in the figure). In contrast, the higher level learns features by considering a sparser set of points in a large area (bottom blue squares in the figure). In the example in Figure 4(a), in which the runner’s foot performs a complex motion, it can be observed that the lowest level captures small and diverse motions (e.g., rotation of the heel) \(M_t^1\), while the highest level learns the forward motion of the entire body \(M_t^3\).
This interpretation of features as motion vectors can be generalized for the majority of current methods because while they differ in the feature extraction process, they all share the Classic-FP strategy to perform the motion reconstruction process. As such we elaborate on this explainability technique to identify current state-of-the-art framework limitations to predict complex motions. Namely, the motion vector prediction is obtained by combining the dynamic features from the different levels via a learned weighted combination. However, each point motion is obtained using the same set of combination weights \([\Theta ^{1}_{\text{FP}}, \ldots , \Theta ^{L}_{\text{FP}}]\) for all points, frames, and sequences. As a result for every point, regardless of its position space and time, the predicted motion is obtained by the same fixed combination of local, medium and global motions. Based on this technique, we can understand that (i) different features can be associated with the different levels of motions forming the complex resultant motion, and (ii) knowing different parts of the objects might be subject to different types of movements highlights the strong limitation in having the same combination of motion levels. Specifically, while a set of weights might lead to the appropriate combination of the motion vectors in Figure 4(a), in which a local movement is analysed (foot), it does not hold in the case of the “Woman-Running” sequence in Figure 4(b), in which a more global movement is highlighted (torso). The points in the lower torso perform a rigid movement forward corresponding to the global motion of the body, while the lower part of the body performs a quite dynamic rotation of the foot. This means that only the global motion vector (pointing forward) would be sufficient to describe the movement of the torso. However, local features (hence local motions) cannot be neglected since this would lead to neglecting the local motions in parts with strong local movement, such as the foot. As a result, in Figure 4(b) local motion vectors (\(M_t^1\)) clearly lose any motion interpretation and become instead random vectors mainly used to compensate for the erroneous addition of multiple motion vectors in this part of the body.
It is worth mentioning that while this understanding might appear straightforward, to the best of our knowledge, this is the first work explaining PointRNN and similar hierarchical architectures when processing 3D deformable objects, showing the limitation in adopting a fixed combination of hierarchical features in the prediction phase. In the next section, we propose an architecture that overcomes this limitation by introducing an attention-based mechanism in the prediction phase.

4 Proposed AGAR Method

To address the limitations identified in the previous section, we now propose an improved architecture for point cloud prediction, depicted in Figure 5. The proposed architecture preserves the state-of-the-art global framework composed of a DE, FP and prediction phase. However, we propose to replace current state-of-the-art modules with improved versions to leverage the point cloud semantic structure during the DE phase and to perform an adaptative combination of dynamic features in the FP phase.
Fig. 5.
Fig. 5. Proposed AGAR prediction architecture composed of DE, FP and prediction phase. In the DE phase, the architecture consists of an SS-GNN module followed by graph-RNN cells. The SS-GNN module extracts spatial features from the point cloud, which are then utilized by the graph-RNN cells to learn dynamic features. In the FP phase, the state-of-the-art FP modules are replaced by a novel Adaptative feature combination module able to dynamically combine hierarchical features according to the scene.

4.1 Addressing Limitation 1: Inclusion of Structural Relationships between Points

To overcome the lack of geometrical prior with meaningful spatial/semantic information, we propose an initial graph neural network denoted by Spatial-Structure GNN (SS-GNN) that processes each frame to extract for each point spatial features that carry local topological information. From the learned spatial features, we then construct a spatio-temporal graph that incorporates the point structural/semantic information and uses that information to build representative neighbourhoods of points. The spatio-temporal graph is processed by a proposed graph-RNN cells that can extract point cloud behaviour as dynamic features. Below, we present each of the proposed modules in detail.

4.1.1 Spatial-Structure GNN (SS-GNN).

Given an input point cloud \(P_t\) for each point i, the SS-GNN learns a spatial feature \(s_{i,t}\) describing the point’s local geometric structure. To learn these features SS-GNN starts by constructing a coordinate graph \(\mathcal {G}^{C}_t=(P_t,\mathcal {E}^C_t)\) by taking the points \(P_t\) as vertices and by building directed edges \(\mathcal {E}^C_t \in \mathbf {R}^{N \times k}\) between each point to its k-nearest neighbours based on Euclidean distance. The SS-GNN is composed of three layers, each layer performs a graph message-passing convolution [46]. At the hth layer, for a target point i, all its neighbouring points \(j \in \mathcal {E}_i^C\) exchange a message along the edge connecting the two points. The message between points is obtained by processing the concatenation between the target point spatial feature at the previous layer \(s_{i,t}^{h-1}\); the target point coordinates \(p_{i,t}\); the geometry displacement between target points i and it neighbours j (\(\Delta p_{ij}\)). A symmetric function is then applied to aggregate all the messages into an updated feature for the target node. More formally, the message between two nodes (\(m^{h}_{ii, t}\)) and the output spatial features (\(s^{h}_{i,t}\)) are obtained as follows:
\begin{align} m^{h}_{ij,t} &= \Theta _S^h\left(s_{i,t}^{h-1} \; ; p^l_{i,t} \; ; \Delta p_{ij} \right), \end{align}
(3)
\begin{align} s^{h}_{i,t} &= \bigoplus _{ j \in \mathcal {E}_i^C} \left\lbrace m^{h+1}_{ij,t} \right\rbrace , \end{align}
(4)
where \(\Theta ^h_S\) is a set of learnable parameters at layer h and “\(;\)” identifies the concatenation operation. The \(\bigoplus\) represents an element-wise max pooling function that acts as an activation function by introducing non-linearity. It is important to note that the above operation does not involve spatio-temporal aggregation. The SS-GNN processes geometric relations between a point and its neighbourhood at the same time step to learn the point’s local topology. This information is used to build the spatio-temporal graph processed by subsequent modules.

4.1.2 Graph-RNN.

Each graph-RNN cell, at level l, takes as input the point coordinates, spatial and dynamic features (\(P^{l}_t\) \(S^{l}_t\) \(D^{l}_t\)) and learns updated dynamic features \(D^{l+1}_t\) describing the point’s dynamic behaviour. To this end, using the output from its previous interaction at time \(t-1\) in a recurrent manner, the graph-RNN cell builds a spatio-temporal graph \(\mathcal {G}^{\text{ST},\,l}_t =(P^l_{t},\mathcal {E}_t^{\text{ST}})\) between the point clouds \(P^l_t\) and \(P^l_{t-1}\). Unlike the coordinate graph, which is built on geometric distances, the spatio-temporal graph is built based on the spatial features distance. Specifically, for each point i at time t, we calculate the distance between the point spatial feature \(s_{i,t}\) and the spatial feature from other points in the present frame \(s_{j,t}\) and in the past frame \(s_{j,t-1}\). Each point i is connected to its k-closest points in present time t and its k-closest in past time \(t-1\). By connecting points that share a common local structure, we are able to establish correspondence between points that despite not being close in the Euclidean space, they share semantic similarities and therefore they will most likely share motion vectors. Figure 6 depicts an example of a spatio-temporal graph constructed between two frames in a fast-moving sequence of a person running (some edges are hidden for image clarity). The dashed boxes in Figure 6 show the edges built for the points in the foot when using spatial feature distance –our approach– (upper box, in red) and the edges built if we had used coordinate distance –state-of-the-art approach- (lower box, in blue). The edges built on spatial feature similarity (in red) can correctly match points across time, while edges based on geometry proximity would lead to incorrect grouping. As a result, the network learns dynamic features from neighbourhoods of points that share similar semantic/structural properties.
Fig. 6.
Fig. 6. Spatio-Temporal graph \(G_{st}\), with some temporal edges coloured in red; Dashed box depicts the difference between building the \(G_{st}\) using spatial features or using point coordinates.
Similarly to the SS-GNN, the graph-RNN extracts dynamic features by performing a message-passing convolution between a point and its neighbourhoods in the spatio-temporal graph. For each target point, we learn a message for each edge by processing the concatenation of the target point dynamic feature (\(d_{i,t}^{l}\)); the neighbour point dynamic feature (\(d_{j,t^{\prime }}^{l})\) where \(t^{\prime }\) can be either t or \(t-1\); the coordinates difference (\(\Delta p_{ij}\)), spatial features difference (\(\Delta s_{ij}\)); temporal different (\(\Delta t_{ij}\)) between the target and neighbour point. All the messages are aggregated into a single representation to update the target point dynamic features \(d^{l+1}_{i,t}\). The operation can be formalized as
\begin{align} m^{l}_{ij,t} &= \Theta _D^{l}\left(d_{i,t}^{l} ; \;d_{j,t^{\prime }}^{l}; \; \Delta p_{ij}; \; \Delta s_{ij}; ; \Delta t_{ij}\right), \end{align}
(5)
\begin{align} d^{l+1}_{i,t} &= \bigoplus _{ j \in \mathcal {E}_i^{ \text{ST} } } \left\lbrace m^{l}_{ij,t}\right\rbrace . \end{align}
(6)
The learned spatial features are used not only to connect points with similar spatial characteristics in both the present and past frame but are also directly incorporated in the graph-RNN convolution. As a result, the graph-RNN learns a point dynamic behaviour taking into account structural relations to neighbourhood points. This inclusion of point spatial features in the graph-RNN cell convolution, allows the network to learn more representative dynamic features and helps to preserve the predicted point cloud shape.

4.2 Addressing Limitation 2: Adaptative Feature Combination

We now address the current framework limitation to generate complex motions caused by the fixed combination of dynamic features in the FP phase. To overcome the issue, we propose to replace the FP modules with an attention-based module denoted Adaptative feature combination represented in detail in Figure 7. Instead of using a fixed combination, the proposed module dynamically assigns an attention value to each level based on the learned features. This attention value determines the amount of influence each level will have on the predicted motion of the point.
Fig. 7.
Fig. 7. Adaptative feature combination module. Given a framework with three hierarchical levels, the module takes as input dynamic features \(D_t^1, D_t^2, D^3_t\) and outputs a single final dynamic feature \(D^{Final}\).
In detail, given an architecture with L hierarchical levels (\(L=3\) in the example in Figure 7), the proposed Adaptative Feature combination module takes as input the dynamic features (\(D^1_t, D^2_t, \ldots , D^L_t\)) learned in the DE phase and combines them into a single final dynamic feature (\(D^{\text{Final} }_t\)). However, we recall that each RNN cell is preceded by a downsampling module, hence each feature needs to be up-sampled before being combined. To do this, the proposed module first interpolates the dynamic features to the same number of points as the first level and processes each independently through a refinement layer \(\Theta ^l_{R}\) to ensure the features are on a similar scale, as follows:
\begin{equation} \psi \left(d^l_{\tilde{i},t}\right)= \sigma \left(\Theta ^l_{R} \: \left\lbrace d^{l}_{\tilde{i},t} \right\rbrace \right), \end{equation}
(7)
where \(d^{l}_{\tilde{i},t}\) are the interpolated features to original number of points, \(\psi (d^l_{\tilde{i},t})\) are the outputted refined features and \(\sigma\) is the activation function. To learn scalar attention values \(\alpha ^l_{i,t}\), the network concatenates the refined features from all levels and processes them through learnable parameters \(\Theta ^{l}_{\alpha }\) as follows:
\begin{equation} \alpha ^{l}_{i,t} =\sigma \left(\Theta ^{l}_{\alpha } \left\lbrace \psi \left(d^1_{i,t}\right); \psi \left(d^{2}_{\tilde{i},t}\right); \psi \left(d^{3}_{\tilde{i},t}\right)\right\rbrace \right). \end{equation}
(8)
The refined dynamic features \(\psi (d^{l}_{i,t})\) are then multiplied by their respective attention value. Hence, the \(\alpha\) value reflects the influence that the learned feature has on the predicted motion, allowing the network to adjust the contribution of each level to the predicted motion.
\begin{equation} \Psi \left(d^l_{i,t}\right)= \psi \left(d^{l}_{i,t}\right) \times \alpha ^{l}_{\tilde{i},t}. \end{equation}
(9)
Lastly, the dynamic features post-attention module \(\Psi (d^l_{i,t})\) are combined by a single learnable layer (\(\Theta _{FC}\)) into the final dynamic features \(d^{\text{Final}}_{i,t} \in D_t^{\text{Final} }\).
\begin{equation} d^{Final}_{i,t} = \sigma \left(\Theta _{FC} \left\lbrace \Psi \left(d^1_{i,t}\right); \cdots ; \Psi \left(d^L_{\tilde{i},t}\right)\right\rbrace \right). \end{equation}
(10)

4.2.1 Explainablity of the Adaptative Feature Combination Module.

A key benefit of the Adaptative feature combination module is that its underlying mechanism can be visualized and explained. This can be seen in Figure 8, which illustrates how the proposed module combines dynamic features to produce motion vectors given two point cloud sequences (Man-Running and Woman-Dancing). For each sequence, Figure 8 depicts the PCA of the dynamic features learned at the DE phase; the learned attention values per point; the individual motion vectors2 produced at each level in the proposed Adaptative architecture and in the Classic-FP architecture (presented in Section 2.2 and Figure 2).
Fig. 8.
Fig. 8. Adaptative Features Combination operation. Example of how the proposed module adaptatively combines local and global motion for different points, and comparison with Classic-FP.
In the Man-Running sequence depicted in Figure 8(a), at the first level the network assigns high attention values (\(\alpha _t^1\)) to the arms and low attention values to the points in the rest of the body. As a result, the predicted motion of the points in the arms is heavily influenced by local motions, while in the rest of the body, the local motions have a very small influence on prediction. The network exhibits similar selective behaviour at the second level, assigning higher attention to the points in the left foot, increasing the influence of the dynamic features \(D^2_t\) have on the motion of the foot. In the third and final level, the network learned non-zero attention values \(\alpha ^3_t\) for the majority of the body. As a result, in the Man-Running sequence, the global motion is the primary contributor to the predicted motion of the points, with the exception of the arm and the foot regions, where the prediction is given by a combination of motions from multiple levels.
Similar considerations can be derived from the second example Woman-Dancing, in which the learned global motions are an accurate descriptor for the majority of the points, except for certain regions with more local movements. The Adaptative feature combination module is able to distinguish between regions and combine properly the different levels of motions based on the distinction. It is worth noting that different attention values are learned for the Man-Running and Woman-Dancing sequences, demonstrating the network’s ability to adapt the attention according to the characteristics of the input data. In summary, the proposed Adaptative feature combination module combines features in an adaptative manner, allowing it to control the composition of global and local motions that best describes the motion of each point. This adaptative operation can be understood and explained through visualization, which may be beneficial for future research on developing more expressive architectures.

5 Implementation

In this section, we describe the datasets and implementation details of our proposed method.

5.1 Datasets

In our experiments, we considered the following datasets:
Mixamo Human Bodies Activities: Synthetically human motions generated following [36], using the online service Mixamo [15] and Blender software [1]. Despite being synthetic, the dataset provides an accurate representation of real-world movements. We create 152 test sequences and 9,375 training sequences (further augmented by randomly changing movement direction, speed, and the body starting position during training). Each training sequence consists of approximately 50 frames, and for each sequence, we sample \(T=12\) consecutive frames as inputs to the model during training. Similarly, the testing sequences are composed of 12 frames. Each frame in the dataset contains a point cloud consisting of 1,000 points, which we found sufficient for capturing a rich and detailed representation of the human body.
CWIPC-SXR Human motions [32]: Real-world human motions in social settings. The dataset consists of 21 dynamic sequences. The first 60 frames of each sequence are sampled at 10 fps and to 1,000 points, resulting in 21 sequences of 15 frames. Given its reduced size, this dataset is not used for training but only for testing.
JPEG Pleno Voxelized Full Bodies [5]: Real-world human bodies. The dataset is composed of four sequences known as longdress, loot, redandblack, and soldier. Each sequence is downsampled to 12 frames and 1,000 points. This dataset is used only for testing.
Moving MNIST Point Cloud: Created by converting the MNIST dataset of handwritten digits into moving point cloud, as previous works [7]. The sequences are generated by applying rigid motion at random to each digit. Each sequence contains 20 frames (\(T = 20\)) with either 128 (1 digit) or 256 points (2 digits).
Argoverse [3]: Large scale automotive dataset. We gfvc the same train and test data as in PointRNN [7]. The dataset contains 910 training sequences and 209 test sequences. Each sequence contains 20 frames (\(T = 20\)), and each frame is downsampled to 1,024 points.
MSRAction3D [19]: Real-world human motion performing annotated actions. The dataset consists of 567 Kinect depth videos with 20 action categories. We sampled each point cloud to 1,024 points and used the same training and test conditions as works [9, 23].

5.2 Benchmarking

This subsection outlines the tasks, as well as the state-of-the-art benchmark methods used for comparison.

5.2.1 Prediction Task.

In the prediction task, we consider both short-term and long-term predictions. In short-term prediction, at each iteration, the network takes as input the ground truth frame \(P_t\) to predict the next frame \(\hat{P}_{t+1}\). At the following prediction step, the network will be predicting \(\hat{P}_{t+2}\), having as input the ground truth \(P_{t+1}\). This is repeated till the end of the sequence. In long-term prediction, the predicted frame from the previous interaction \(\hat{P}_t\) is used as input to predict the next frame \(\hat{P}_{t+1}\). In long-term prediction, only the later half (\(T/2)\) of sequence is predicted using this strategy. For benchmarking, since point cloud prediction of human bodies is mostly an unexplored topic, the range of possible choices of baseline methods to compare our work is limited. Moreover, many of the existing point-based RNN point cloud prediction methods designed for automobile scenes do not provide the necessary materials to be replicated. Therefore, besides selecting the most related works available, we adapted several methods that, while not originally designed for point cloud prediction, are well-recognized in the field of point cloud sequence processing. For the point cloud prediction task, we consider the following as baseline models: (i) Copy-Last-input model, which simply copies the past point cloud frame instead of predicting it; (ii) PointPWC-Net [45] a hierarchical point-based architecture to extract the motion flow between two frames; (iii) FlowStep3D [17] a hierarchical point-based architecture to extract learned motion flow between two frames via RNN cells; (iv) PSTNet [9] a hierarchical point-based architecture for action classification of human body sequences; (v) Monet [25], an LSTM point-based approach with an attention mechanism for feature extraction; (vi) PointRNN [7] ( k-NN): point-based RNN architecture presented in Section 2.2. Both PointPWC-Net and FlowStep3D were originally designed to learn the motion flow between two frames. To extend these two models to the task of predicting future frames, we incorporate a prediction phase into their architectures. This prediction phase refines the extracted motion flow via fully connected layers and calculates a predicted point cloud at the next time step. Similarly, the PST-Net architecture, designed for action classification, is adapted for the prediction task by adding an FP phase (with Classic-FP) to propagate the learned features to the original number of points, followed by a prediction phase to generate a prediction of the point cloud at the next timestep given the propagated features. To differentiate the adapted models from their original counterparts, we denote the adapted models for the prediction task as PointPWC-Net-pred, FlowStep3D-pred and PST-Net-pred respectively.

5.2.2 Action Classification Task.

To study the generalizability of the proposed AGAR framework for dynamic feature learning, we extended its application to the classification task. In this task, the AGAR takes a point cloud sequence as input and outputs a classification score. To adapt AGAR for the classification task, we discarded the FP phase and the prediction phase. Instead, the dynamic features from the last level are max-pooled to form a global feature, which is used to generate the classification score. We denoted this architecture adapted for classification tasks as AGAR-cls. Given human action classification from point clouds sequences a well-studied problem we compare AGAR-cls to well-established methods such as MeteorNet [23], PSTNet [9], P4Transformer [8] without adaptations.

5.3 AGAR Architecture Details

For the prediction and classification tasks, we implemented AGAR and an AGAR-cls (adapted for classification) architectures with three hierarchical levels \((L=3)\) respectively. In both cases, the SS-GNN in the first level consists of three layers with \(64, 128\), and 128 dimensions, respectively. Before each level, the input point cloud is downsampled to 250,64,16 points, respectively. The number of points at each level was selected through experimentation. The pre-defined number of points means AGAR can effectively process high-resolution point clouds since only the first FPS operation is affected by the input size. Each level contains a graph-RNN cell that learns dynamic features with 128 dimensions. The number of nearest neighbours \((k)\) is 8 for all graph-RNN cells. All the models are trained using the Adam optimizer, with a learning rate of \(10^{-4}\) for 500,000 interactions. In the training phase, we utilize a batch size of 16 for the Mixamo Human Bodies dataset, 32 for the MNIST and MSRAction3D dataset, and 4 for the Argoverse dataset.

5.4 Training and Metrics

The AGAR architecture has multiple end-to-end parameters, trained in a self-supervised fashion by comparing the predicted point cloud \(\hat{P}_{t+1}\) with the target point cloud \(P_{t+1}\). Unlike supervised methods [17, 23, 41, 45], which require the ground-truth motion flow to train the network, in a self-supervised setting the ground-truth data can be obtained from the input data itself. This technique allows us to train on a dataset of deformable dynamic point clouds, such as human bodies dataset [5, 15, 32], where annotated ground-truth motion vectors are not available.

5.4.1 Training Metrics.

To measure the difference between the predicted point cloud and the ground-truth point cloud during training, we employ the commonly used chamfer distance (CD) [14] and earth’s moving distance (EMD) [2]. These metrics are defined as the following:
Chamfer distance (CD): The CD measures the distance between each point in the predicted point cloud and its closest target point in the reference point cloud, and vice-versa.
\begin{align} d_{CD}(P, \hat{P}) &= \frac{1}{n}\sum _{p \in P} \min _{ \hat{p} \in \hat{P}} || p - \hat{p}||^2 + \frac{1}{n} \sum _{p \in \hat{P}} \min _{ p \in P} || \hat{p} - p||^2. \end{align}
(11)
Earth’s moving distance (EMD): The EMD solves an optimization problem, by finding the optimal point-wise bijection mapping between two point clouds \(\theta : P \xrightarrow {} \hat{P}\). The EMD distance is then given by the distance of the points at both ends of this mapping, as follows:
\begin{align} d_{EMD}(P, \hat{P}) &= \min _{\theta :P \xrightarrow {} \hat{P} } \sum _{p \in P} || p - \theta (p) ||^2. \end{align}
(12)
Although the EMD and CD metrics are commonly used in point cloud analysis, they may not always provide an accurate measure of similarity. The CD only considers the nearest neighbour of a point and does not take into account the global distribution of points. On the other hand, EMD tries to find a unique mapping between two point clouds. However, in most cases a unique mapping is realistically impossible, resulting in a measurement that is rarely correct for all points. Since CD and EMD measure different notions of similarity with different shortcomings, we use a combination of both metrics as the loss function in order to make the loss function more robust.

5.4.2 Evaluation Metrics.

To evaluate our model we used the CD and EMD metrics also used for training. However, since CD and EMD measure the similarity between two point clouds by averaging the distance across all points, they tend to flatten their distance scores towards zero values. This is because in a point cloud, the majority of points are perfectly predicted (either no motion or little motion), and most of the high prediction errors are concentrated in small areas of high or complex motion. Therefore, to better evaluate the model’s ability to predict complex motions, besides the CD and EMD, we also consider the following additional evaluation metric.
Chamfer distance of the top %5 worst points (CD Top %5): This metric returns the average CD distance of the 5% of points with the worst predictions (i.e., points with the farthest distance to their closest point). We found that this CD Top %5 focuses on the regions where the body performs complex motions and provides the best correlation with the visual quality. To the best of our knowledge, we are the first to work to present results using CD top 5% metric.

6 Experimental Results

In this section, we present and discuss the results of our proposed AGAR method, described in Section 4 for each task and dataset. We begin by presenting and discussing the results of point cloud prediction of human body motions, which is the main goal of this article. Next, to study the robustness of the proposed method, we present the experimental results for the prediction of rigid point clouds (i.e., moving digits and automobile scenes). This is followed by the results for action classification on human body motions. Lastly, we present an ablation study on the prediction of human body motions.

6.1 Prediction of Synthetic Human Bodies Motions - Mixamo Human Bodies

The short-term prediction results from Mixamo dataset of human body activities can be found in Table 2 and Figure 9 depicts prediction examples for two sequences. In addition to evaluating the AGAR architecture with Adaptative feature combination described in Section 4, we also evaluate a modified AGAR architecture where the Adaptative feature combination is replaced by Classic-FP. The results in Table 2 show PointRNN and both variations of the AGAR architecture outperformed the remaining methods by a large margin, demonstrating the superiority of the RNN architecture for interactive prediction. Furthermore, both AGAR architectures consistently outperform PointRNN, achieving lower prediction error in all three metrics (CD, EMD, CD Top%5). Notably, the AGAR with Adaptative feature combination) achieves an EMD error of 58.2, surpassing PointRNN’s EMD error of 68.0 with a 10.2 gain. This gain is especially significant for deformable objects since shape distortion has a high visual impact. This is particularly noticeable in the last frame (\(t =10\)) of the “Woman-Turning” sequence (in Figure 9), where the AGAR prediction suffers less deformation compared to the PointRNN prediction. In the following, we analyze the improvement provided by each component of the proposed AGAR method to better understand the impact of each limitation on the prediction task.
Table 2.
Mixamo
(Synthetic Human bodies dataset)
ModelCDEMDCD Top 5%
Copy-last-input0.01056123.40.2691
PointPWC-Net-pred[45]0.09358118.50.2601
FlowStep3D-pred[17]0.09153115.60.2575
PSTNet-pred[9]0.08984114.10.2556
MoNet[25]0.0648675.70.1897
PointRNN[7]0.0035168.00.1593
AGARClassic-FP0.0026259.60.1412
Adaptative0.0025458.20.1346
Table 2. Point Cloud Prediction Results on the Mixamo Dataset
Fig. 9.
Fig. 9. Example of prediction of human body activities on the Mixamo dataset.
To understand the impact of combining features in an adaptative manner we compare the AGAR with Adaptative feature combination and the AGAR with Classic-FP. Table 2 shows the AGAR with Adaptative feature combination achieves a lower prediction error compared to the AGAR with Classic-FP. While the error improvement in terms of CD and EMD is relatively small, the CD Top 5% metric, which is more sensitive to local distortion, shows a clear improvement in the AGAR with Adaptative feature combination. The superior performance of adaptatively combining dynamic features can also be seen by looking at the visual results in Figure 9. We can notice the AGAR with Adaptative features combination predicts better specific regions such as the hands and the legs, which involve complex motions. This improvement is due to the module’s ability to generate refined motion predictions required in these regions. These results show the clear advantage of adaptively combining dynamic features to predict complex motions.
To understand the advantages of incorporating the structural relations between points when dynamic learning features, in Table 3 we compare: (i) an AGAR architecture; (ii) an AGAR model that does not learn spatial features (without the SS-GNN module). Hence does not take the structural relation between the point into account, when learning dynamic features; (iii) an AGAR model that learns spatial features, but builds only a temporal graph i.e., a k-nn graph is built only connecting each point of the frame t with points in frame \(t-1\) (the total number of neighbours \(k=8\) remains the same for fairness). All three model variations have a Classic-FP phase. The results show there is a relatively small gain in building a complete spatio-temporal graph but significant improvement by learning spatial features. It is worth noticing, that the CD Top 5% (the most sensitive metric to point cloud local shape distortion) is significantly lower in the model that learns spatial features compared to the model that does not learn spatial features. This demonstrates that while both models are able to capture the overall motion, the inclusion of spatial features in the DE phase significantly improves the accuracy and preservation of the predicted point cloud’s shape.
Table 3.
Mixamo
(Synthetic human bodies dataset)
ModelType of graphSpatial featuresCDEMDCD Top 5%
AGAR (Classic-FP)(i) spatio-temporal\(\checkmark\)0.0026259.60.1410
(i) spatio-temporal\(\times\)0.0034167.00.1602
(ii) only temporal\(\checkmark\)0.0026660.00.1417
Table 3. Comparison of Three Variations of the AGAR Framework Demonstrating Gain from the Including Structural Relations between Points in the Spatio-temporal Graph

6.2 Prediction of Real Human Bodies Motions - JPEG and CWIPC-SXR Dataset

We now turn our focus to real-world human bodies datasets: the JPEG and CWIPC-SXR datasets. Since both the JPEG and CWIPC-SXR datasets are too small to train models, they are only used for the evaluation of the models trained on the Mixamo dataset. Table 4 depicts the short-term prediction results from real-world data from the JPEG dataset, and the CWIPC-SXR dataset. It can be noted the Copy-last-input model has a significantly lower prediction error (EMD of 42) in real-world datasets compared to the error on the Mixamo dataset (EMD of 123). In the JPEG and CWIPC-SXR datasets, the point clouds were acquired from real test subjects only allowed to move in a small area, resulting in a lower magnitude of motion compared to the Mixamo dataset. As such, real-world datasets are significantly easier to predict, and all tested models are able to make accurate predictions. Among them, the AGAR model achieves the smallest prediction error across all metrics. However, given the low magnitude of motion, the differences between models are minimal. Importantly, these results demonstrate that the AGAR model trained on synthetic human motion datasets can be effectively applied to real-world human motion datasets despite the large disparity in motion magnitudes between the two datasets.
Table 4.
JPEG and CWIPC-SXR
Real-world human bodies dataset
MethodJPEGCWIPC-SXR
CDEMDCD Top 5%CDEMDCD Top 5%
Copy Last Input0.0011842.00.090010.0029543.20.12915
PointRNN0.0010941.30.0834610.0015743.40.10973
AGARClassic-FP0.0010138.60.081720.0015040.80.10655
Adaptative0.0009537.40.077540.0015539.80.10760
Table 4. Prediction Error for the JPEG and CWIPC-SXR Datasets

6.3 Prediction of Rigid Object - MNIST Dataset Moving Digits

The simplicity of representation and movements performed by the MNIST dataset makes it the ideal dataset to test the long-term prediction of the proposed AGAR method. Long-term prediction is when the network uses its output predictions at a time-step as input for the subsequent time-step. We present the prediction results for the MNIST dataset in Table 6, and prediction examples in Figure 10. Table 6 shows the AGAR model has superior prediction performance compared to the PointRNN model. This performance gap is particularly large for point clouds containing two digits. In two-digits, the Point-RNN CD prediction error is 14.54 whereas the AGAR (Classic-FP) CD error is 1.67. This large gain is due to the AGAR’s ability to learn spatial features, which allows it to understand the structure to discern the two distinct shapes. This improvement can be seen in Figure 10, where all the evaluated models exhibit a progressive loss of shape. However, the AGAR model suffers from significantly less deformation compared to Point-RNN. This visualization demonstrates that the AGAR is better at preserving the spatial structure over time, a direct effect of learning the point cloud spatial structure. Lastly, it can be noted the AGAR model with adaptative feature combination and with the Classic-FP have a similar prediction error, also seen in the example in Figure 10. The reason being in the moving digits dataset there are no complex motions (i.e., the digits perform simple rigid translation), as such the control over the motion provided by the Adaptative features combination module is just unnecessary parameterization and does not translate into more accurate predictions.
Table 5.
Argoverse
Automobile scenes dataset
MethodLong-Term Prediction
CDEMD
Copy Last Input0.58121092.3
PointRNN0.2541895.28
AGARClassic-FP0.2680875.22
Adaptative0.2839893.24
Table 5. Prediction Error for the Argoverse Dataset
Table 6.
MNIST
Dataset
MethodLong-Term prediction
1 digit2 digits
CDEMDCDEMD
Copy Last Input262.4615.94140.1415.8
PointRNN2.252.5214.546.42
AGARClassic-FP0.881.521.672.60
Adaptative0.961.601.752.62
Table 6. Prediction Error on the MNIST Dataset
Fig. 10.
Fig. 10. Long-term MNIST predictions examples.

6.4 Prediction of Automobile Scenes- Argoverse Dataset

Table 5 shows the results of training and evaluating the AGAR model and the PointRNN baseline with the Argoverse automobile dataset. Not surprisingly, both methods achieved similar prediction errors. This was an expected result, as the characteristics of deformable bodies on which AGAR relies are not present in the automobile dataset. More specifically the structural information in the data is not informative and reliable enough for the SS-GNN module to leverage when learning features. Similarly, the data does not perform complex motions that would require Adaptative features combination module. Hence, the inclusion of both modules is not translated into a meaningful gain. However, despite being designed for deformable objects, the results demonstrate that the proposed AGAR is still capable to process and capturing the overall correct movement from point clouds of automobile scenes.

6.5 Action Recognition of Human Motions - MSR3DAction Dataset

Table 7 presents the results of the action recognition task on the MSRAction dataset. As described in Section 5.2, here we compare the AGAR-cls with multiple well-known methodologies optimized for action classification. In the table, we provide the accuracy of different methods given input point cloud sequences of 4,8,12,16,24 frames. When looking at a shorter sequence (less than 12 frames), the proposed AGAR-cls outperformed state-of-the-art methods. Notably, for sequences of 8 frames, the AGAR-cls achieved 87.2% accuracy a 5% improvement over the PSTNet and P4Transfomer. However, such relative gain is lost for sequences longer than 12 frames, where both PSTNet and P4Transformer are slightly better than AGAR-cls. The reason for this performance decline in relation to state-of-the-art can be attributed to the RNN architecture of the AGAR-cls framework. Accurate action recognition requires the model to retain information about early movements throughout the entire sequence. In the AGAR-cls this information is retained in RNN hidden states. However, these states are continuously updated each iteration, as a result, the older information is not efficiently retained as PSTNet which processes all frames simultaneously. Despite this limitation, the results demonstrate the AGAR-cls ability to capture complex motions from human body point clouds, making it a promising model for action recognition tasks, especially for shorter sequences. Furthermore, the understanding of the dynamic feature’s role in capturing complex human motions presented in Section 3, can also provide valuable insight for action recognition. Understanding how the composition of local and global motions leads to a class prediction can help explain why certain actions are misclassified, leading to the design of more accurate architectures.
Table 7.
MSR Action
MethodInputAccuracy
#Frames
1481216182024
Vieira et al. [35]depth      78.20 
Klaser et al. [18]     81.43  
PointNet++ [31]point61.61       
MeteorNet [23] 78.1181.1486.5388.21  88.50
PSTNet [9] 81.1483.5087.8889.90  91.20
P4Transfomer [8] 80.1383.1787.5489.56  90.94
AGAR-cls 81.4887.2088.2188.55  90.09
Table 7. Action Recognition Accuracy (%) on the MSR-Action3D Dataset for 4,8,12,16,24 Frames as Input

6.6 Ablation Study

To gain a deeper understanding of our proposed architecture, an ablation study is conducted on the Mixamo dataset for short-term prediction.
The number of levels (Table 8): The best results were achieved with architecture with three hierarchical levels (\(L=3\)), showing that increasing the number of levels does not necessarily lead to superior performance. However, a minimum number of levels does impact positively the accuracy, confirming the importance of hierarchical learning.
Table 8.
Mixamo
(Synthetic human bodies dataset)
Number of levelsCDEMDCD Top 5%
10.0029665.40.166
20.0027661.20.1461
30.0026259.60.1412
40.0029062.00.14745
Table 8. Effect of the Number of Levels
Neighbourhood size (Table 9): The results show an increasing number of neighbours points ( k) improves the model performance. However, increasing neighbours also significantly increases the memory required to train the model. This illustrates one of the main limitations of the current deep learning frameworks, which is the high GPU memory requirements. This limitation was not addressed in this article.
Table 9.
Mixamo
(Synthetic human bodies dataset)
Number of neigborhoodsCDEMDCD Top 5%
40.0029062.80.1489
80.0026259.60.1412
120.0026458.90.1407
Table 9. Effect of the Number of Neighbours
The downsampling factor (Table 10): Given a point cloud with 1,000 points, a down-sample by a factor of 2 for each level leads to the best results. Using a downsampling factor of 1 (i.e., no sampling between layers) resulted in the worst performance, similar to the performance obtained using a single level (\(L=1\)). This demonstrates that the improvement gained from using hierarchical architecture is due to learning features from neighbourhoods at different scales.
Table 10.
Mixamo
(Synthetic human bodies dataset)
Down-sampling factorCDEMDCD Top 5%
10.0031465.00.160
20.0025958.70.138
40.0026259.60.1412
Table 10. Effect of the Downsampling Factor

7 Conclusion

The goal of this article is to improve current prediction frameworks for point clouds representing deformable 3D objects, with a focus on human bodies motions. To reach this goal, we investigated the current state-of-the-art point-based RNN prediction framework and identified its limitations when processing deformable shapes and complex motions present in deformable objects. To overcome these limitations, we propose an improved architecture for dynamic point cloud processing. This architecture includes an initial graph-based module that learns the structural relations of point clouds as spatial features. From the spatial features, we then construct spatio-temporal graphs. This module is followed by a hierarchy of graph-RNN cells, to extract dynamics features from the spatio-temporal graphs taking the learned structural relations between points into account. Lastly, as a key novelty, we propose a module able to combine dynamic features learned by the graph-RNN cells in a adaptative manner Our proposed module assigns a level of attention to each hierarchical feature in order to control the composition of local and global motion that best describes each point motion. Notably, the adaptative combination module inner-working can be visualized and understood, opening the door to future research to gain insights to develop more expressive architectures. Our experimental results demonstrate the superiority of the proposed architecture in motion prediction and in action classification of deformable objects. We also showed this improvement is due to the method’s ability to exploit the spatial structure of the point cloud to extract more representative dynamic features, as well as the adaptative combination of the dynamic features to predict complex motions.

Footnotes

2
For the sake of image clarity, the motion vectors were uniformly sampled.

References

[1]
Blender. 1994. Blender - a 3D Modelling and Rendering Package. Retrieved June 29, 2022 from http://www.blender.org
[2]
Sergio Cabello, Panos Giannopoulos, Christian Knauer, and Günter Rote. 2008. Matching point sets with respect to the Earth Mover’s Distance. Computational Geometry 39, 2 (2008), 118–133.
[3]
Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, and James Hays. 2019. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[4]
Girum G. Demisse, Djamila Aouada, and Björn Ottersten. 2018. Deformation-based 3D facial expression representation. ACM Transactions on Multimedia Computing, Communications, and Applications 14, 1s (2018), 1–22.
[5]
Eugene d’Eon, Bob Harrison, Taos Myers, and Philip A. Chou. 2017. 8i voxelized full bodies a voxelized point cloud dataset. ISO/IEC JTC1/SC29 Joint WG11/WG1 (MPEG/JPEG) Input Document M38673 7, 8 (2017), 11.
[6]
Haoqiang Fan, Hao Su, and Leonidas J. Guibas. 2017. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[7]
Hehe Fan and Yi Yang. 2019. PointRNN: Point recurrent neural network for moving point cloud processing. arXiv preprint arXiv:1910.08287 (2019). Retrieved from https://arxiv.org/abs/1910.08287
[8]
Hehe Fan, Yi Yang, and Mohan Kankanhalli. 2022. Point spatio-temporal transformer networks for point cloud video modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 2 (2022), 2181–2192.
[9]
Hehe Fan, Xin Yu, Yuhang Ding, Yi Yang, and Mohan Kankanhalli. 2020. PSTNet: Point spatio-temporal convolution on point cloud sequences. In Proceedings of the International Conference on Learning Representations.
[10]
Pedro Gomes. 2021. Graph-based network for dynamic point cloud prediction. In Proceedings of the 12th ACM Multimedia Systems Conference.
[11]
Pedro Gomes, Silvia Rossi, and Laura Toni. 2021. Spatio-temporal Graph-RNN for point cloud prediction. In Proceedings of the International Conference on Image Processing.
[12]
Pedro Gomes, Silvia Rossi, and Laura Toni. 2022. Explaining hierarchical features in dynamic point cloud processing. In Proceedings of the IEEE Picture Coding Symposium.
[13]
Kun Hu, Zhiyong Wang, Wei Wang, Kaylena A. Ehgoetz Martens, Liang Wang, Tieniu Tan, Simon J. G. Lewis, and David Dagan Feng. 2019. Graph sequence recurrent neural network for vision-based freezing of gait detection. IEEE Transactions on Image Processing 29, (2019), 1890–1901.
[14]
Tianxin Huang and Yong Liu. 2019. 3D point cloud geometry compression on deep learning. In Proceedings of the 27th ACM International Conference on Multimedia.
[15]
Adobe Inc. 2008. Mixamo: Animated 3D Characters for Games, Film, and More. Retrieved from https://www.mixamo.com/
[16]
Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-rnn: Deep learning on spatio-temporal fraphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[17]
Yair Kittenplon, Yonina C. Eldar, and Dan Raviv. 2021. Flowstep3D: Model unrolling for self-supervised scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[18]
Alexander Klaser, Marcin Marszałek, and Cordelia Schmid. 2008. A spatio-temporal descriptor based on 3d-gradients. In Proceedings of the British Machine Vision Conference.
[19]
Wanqing Li, Zhengyou Zhang, and Zicheng Liu. 2010. Action recognition based on a bag of 3d points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[20]
Yunzhu Li, Jiajun Wu, Russ Tedrake, Joshua B. Tenenbaum, and Antonio Torralba. 2018. Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. In Proceedings of the International Conference on Learning Representations.
[21]
Ruixuan Liu and Changliu Liu. 2020. Human motion prediction using adaptable recurrent neural networks and inverse kinematics. Control Systems Letters 5, 5 (2020), 1651–1656.
[22]
Xingyu Liu, Charles R. Qi, and Leonidas J. Guibas. 2019. Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[23]
Xingyu Liu, Mengyuan Yan, and Jeannette Bohg. 2019. MeteorNet: Deep learning on dynamic 3d point cloud sequences. In Proceedings of the IEEE International Conference on Computer Vision.
[24]
Chris Xiaoxuan Lu, Muhamad Risqi U. Saputra, Peijun Zhao, Yasin Almalioglu, Pedro P. B. De Gusmao, Changhao Chen, Ke Sun, Niki Trigoni, and Andrew Markham. 2020. MilliEgo: Single-Chip MmWave radar aided egomotion estimation via deep sensor fusion. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems.
[25]
Fan Lu, Guang Chen, Zhijun Li, Lijun Zhang, Yinlong Liu, Sanqing Qu, and Alois Knoll. 2021. MoNet: Motion-based point cloud prediction network. IEEE Transactions on Intelligent Transportation Systems 23, 8 (2021), 13794–13804.
[26]
Moritz Menze and Andreas Geiger. 2015. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[27]
Benedikt Mersch, Xieyuanli Chen, Jens Behley, and Cyrill Stachniss. 2022. Self-supervised point cloud prediction using 3D spatio-temporal convolutional networks. In Proceedings of the Conference on Robot Learning.
[28]
Yuecong Min, Yanxiao Zhang, Xiujuan Chai, and Xilin Chen. 2020. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[29]
Himangi Mittal, Brian Okorn, and David Held. 2020. Just go with the flow: Self-supervised scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[30]
Liangliang Nan, Andrei Sharf, Hao Zhang, Daniel Cohen-Or, and Baoquan Chen. 2010. SmartBoxes for interactive urban reconstruction. ACM Transactions on Graphics 29, 4 (2010), 10.
[31]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Neural Information Processing Systems.
[32]
Ignacio Reimat, Evangelos Alexiou, Jack Jansen, Irene Viola, Shishir Subramanyam, and Pablo Cesar. 2021. CWIPC-SXR: Point cloud dynamic human dataset for social XR. In Proceedings of the 12th ACM Multimedia Systems Conference.
[33]
Luana Ruiz, Fernando Gama, and Alejandro Ribeiro. 2020. Gated graph recurrent neural networks. IEEE Transactions on Signal Processing 68 (2020), 6303–6318.
[34]
Yan Tian, Yujie Zhang, Wei-Gang Chen, Dongsheng Liu, Huiyan Wang, Huayi Xu, Jianfeng Han, and Yiwen Ge. 2022. 3D tooth instance segmentation learning objectness and affinity in point cloud. ACM Transactions on Multimedia Computing, Communications, and Applications 18, 4 (2022), 1–16.
[35]
Antonio W. Vieira, Erickson R. Nascimento, Gabriel L. Oliveira, Zicheng Liu, and Mario F. M. Campos. 2012. Stop: Space-time occupancy patterns for 3d action recognition from depth map sequences. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications Congress.
[36]
Irene Viola, Jelmer Mulder, Francesca De Simone, and Pablo Cesar. 2019. Temporal interpolation of dynamic digital humans using convolutional neural networks. In Proceedings of the IEEE International Conference on Artificial Intelligence and Virtual Reality.
[37]
Haiyan Wang, Jiahao Pang, Muhammad A Lodhi, Yingli Tian, and Dong Tian. 2021. FESTA: Flow estimation via spatial-temporal attention for scene point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[38]
Lintao Wang, Kun Hu, Lei Bai, Yu Ding, Wanli Ouyang, and Zhiyong Wang. 2023. Multi-scale control signal-aware transformer for motion synthesis without phase. arXiv preprint arXiv:2303.01685 (2023). Retrieved from https://arxiv.org/abs/2303.01685
[39]
Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. 2017. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics 36, 4 (2017), 1–11.
[40]
Xu Wang, Yi Jin, Yigang Cen, Tao Wang, and Yidong Li. 2021. Attention models for point clouds in deep learning: A survey. arXiv preprint arXiv:2102.10788 (2021). Retrieved from https://arxiv.org/abs/2102.10788
[41]
Zirui Wang, Shuda Li, Henry Howard-Jenkins, Victor Prisacariu, and Min Chen. 2020. Flownet3d++: Geometric losses for deep scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[42]
Mingqiang Wei, Zeyong Wei, Haoran Zhou, Fei Hu, Huajian Si, Zhilei Chen, Zhe Zhu, Jingbo Qiu, Xuefeng Yan, Yanwen Guo, Jun Wang, and Jing Qin. 2023. Agconv: Adaptive graph convolution on 3D point clouds. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 9374–9392.
[43]
Yi Wei, Ziyi Wang, Yongming Rao, Jiwen Lu, and Jie Zhou. 2021. PV-RAFT: Point-voxel correlation fields for scene flow estimation of point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[44]
Cheng Wencan and Jong Hwan Ko. 2021. Segmentation of points in the future: Joint segmentation and prediction of a point cloud. IEEE Access 9 (2021), 52977–52986.
[45]
Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, and Li Fuxin. 2020. Pointpwc-net: Cost volume on point clouds for (self-) supervised scene flow estimation. In Proceedings of the European Conference on Computer Vision.
[46]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32, 1 (2020), 4–24.
[47]
Mayssa Zaier, Hazem Wannous, Hassen Drira, and Jacques Boonaert. 2023. A dual perspective of human motion analysis-3D pose estimation and 2D trajectory prediction. In Proceedings of the IEEE International Conference on Computer Vision.
[48]
Chaoyun Zhang, Marco Fiore, Iain Murray, and Paul Patras. 2021. Cloudlstm: A recurrent neural model for spatiotemporal point cloud stream forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence.
[49]
Yi Zhang, Yuwen Ye, Zhiyu Xiang, and Jiaqi Gu. 2020. SDP-Net: Scene flow based real-time object detection and prediction from sequential 3D point clouds. In Proceedings of the Asian Conference on Computer Vision.
[50]
Guanyu Zhu, Yong Zhou, Rui Yao, Hancheng Zhu, and Jiaqi Zhao. 2022. Cyclic self-attention for point cloud recognition. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 1s (2022), 1–19.
[51]
Wenjie Zhu, Zhan Ma, Yiling Xu, Li Li, and Zhu Li. 2020. View-dependent dynamic point cloud compression. IEEE Transactions on Circuits and Systems for Video Technology 31, 2 (2020), 765–781.

Cited By

View all
  • (2024)Volumetric Video on the Web: a platform prototype and empirical studyProceedings of the 29th International ACM Conference on 3D Web Technology10.1145/3665318.3677170(1-10)Online publication date: 25-Sep-2024
  • (2024)Label-free live cell recognition and tracking for biological discoveries and translational applicationsnpj Imaging10.1038/s44303-024-00046-y2:1Online publication date: 7-Oct-2024

Index Terms

  1. AGAR - Attention Graph-RNN for Adaptative Motion Prediction of Point Clouds of Deformable Objects

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 8
      August 2024
      726 pages
      EISSN:1551-6865
      DOI:10.1145/3618074
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2024
      Online AM: 01 May 2024
      Accepted: 14 April 2024
      Revised: 05 March 2024
      Received: 13 July 2023
      Published in TOMM Volume 20, Issue 8

      Check for updates

      Author Tags

      1. Point cloud
      2. graph neural network
      3. motion prediction
      4. explainability

      Qualifiers

      • Research-article

      Funding Sources

      • CISCO

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)419
      • Downloads (Last 6 weeks)79
      Reflects downloads up to 14 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Volumetric Video on the Web: a platform prototype and empirical studyProceedings of the 29th International ACM Conference on 3D Web Technology10.1145/3665318.3677170(1-10)Online publication date: 25-Sep-2024
      • (2024)Label-free live cell recognition and tracking for biological discoveries and translational applicationsnpj Imaging10.1038/s44303-024-00046-y2:1Online publication date: 7-Oct-2024

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media