1 Introduction
Point cloud sequences are a flexible and rich geometric representation of volumetric content used in a wide range of applications from autonomous driving [
26,
30], robotics [
24,
50] to virtual/mixed-reality services [
4,
34]. Such sequences consist of consecutive point clouds, each composed of an unordered collection of 3D points representing 3D scenes or 3D objects. Although the point cloud is a highly appealing representation impacting multiple sectors, how to properly process it is still an open challenge. One of the most successful methodologies was the development of neural networks able to learn directly from unstructured point cloud data. This approach was pioneered by PointNet [
6] architecture, which learns features by processing each point independently. However, in such architecture, the local structures which contain key semantic information of 3D geometry are not captured. PointNet++ [
31] addresses this issue by considering neighbourhoods of points instead of acting on each one independently. To this end, PointNet++ employs a hierarchical architecture that processes process point neighbourhoods at increasingly larger scales along a multi-resolution hierarchy, as shown on the left side of Figure
1. This approach groups the local features learned from small neighbourhoods into larger neighbourhoods and processes them to learn higher-level features. By learning
hierarchical features, the network can abstract the multiple local-to-global structures within the data. Although PointNet++ hierarchical architecture was initially designed for static point clouds, it has since been extended to the case of dynamic point clouds [
22,
23]. In these cases, instead of extracting features from neighbourhoods in a single point cloud, the network extracts dynamic features from a hierarchy of spatio-temporal neighbourhoods across time. The learned dynamic features can be applied to a wide range of downstream tasks, such as action classification, motion prediction and segmentation. In this article, we focus on the point cloud prediction task. Specifically given a point cloud sequence
\(\mathcal {P}=\lbrace P_{1}, \ldots , P_{T}\rbrace\), composed of
T frames with
\(p_{i,t} \in \mathbf {R}^3\) being the Euclidean coordinates of point
i in point cloud
\(P_t \in \mathbf {R}^{N \times 3}\), our goal is to predict the coordinates of future point clouds (
\(\hat{P}_{T+1},\ldots ,\hat{P}_{T+Q}\)), where
Q is the prediction horizon.
At the moment, point-based hierarchical methods can be considered the de-facto state-of-the-art approach for point cloud prediction. However, while these methodologies have shown good performance when predicting simple and rigid movements as translations in automobile scenes [
3], they are often limited when predicting the motion of 3D deformable objects. Addressing this limitation is the main goal of this article. Predicting deformable objects is challenging since the point cloud shape changes over time and performs highly
complex motions. For example, in a 3D representation of a football player running or a dancer performing during a music event, their point cloud representations change over time following different postures. Moreover, the performed movements are not rigid transformations but rather a combination of multiple and diverse local motions. For instance, if we imagine the player raising the hand while running, their arm and hand will be characterized by a combination of movements (i.e., local rising movement and the global forward translation). Given their characteristics, processing 3D deformable objects presents two major challenges: (i) establishing point correspondence across time and preserving the shape of the predicted point cloud; (ii) generating accurate motion predictions that are a composition of multiple movements at different levels of resolution.
To address the above challenges, we must first understand if the current state-of-the-art models are able to address such challenges. Within this context, we first demonstrate the model’s inability to establish precise temporal correlations and preserve the predicted point cloud shape. This is because they fail to consider the structural relationships between the points during the learning process. Then, to investigate the challenge of predicting complex motions, we employ the explainability techniques introduced in our previous work [
12]. These techniques demonstrated that the hierarchy of dynamic features corresponds to learning from local to global motions (in the centre of Figure
1). In this article, we build upon this interpretation to identify the technical limitations of the current framework approach. Specifically, we show that in most methodologies [
7,
11,
22,
29] to generate predictions of future motions, the hierarchical features are combined via learnable weights. Most critically, to preserve permutation invariance, when combining hierarchical features, the same learned weights are applied to all points across frames. However, in deformable objects, not all points benefit from the same combination of hierarchical features. For example, some points can be described entirely by global motions, while other points are better described by a combination of global and local motions. We show that this
fixed combination of hierarchical features is a key limitation to the network’s ability to predict complex motions.
Based on the limitations identified above, we propose AGAR: an attention-based hierarchical graph-
recurrent neural network (
RNN) for point cloud prediction of deformable objects. Our proposed architecture includes an initial graph-based module that extracts the underlying geometric structure of the input point cloud as spatial features. From the learned spatial features, we construct a
spatio-temporal graph that forms more representative neighbourhoods than current methods that neglect the point cloud structure. The graph is then processed by sequential graph-RNN cells that take structural relations between points into account to learn dynamic features. To address the limitation of the fixed combination of hierarchical features, we propose a novel module denoted as
Adaptative feature combination. The proposed module employs an attention mechanism to dynamically assign different degrees of importance to each level of hierarchical features. As such, for each point, the network can control the composition of the local and global motions that best describe the point behaviour. This concept is illustrated in the right part of Figure
1, where the network selects the regions that benefit from particular motions (i.e., local, semi-local, global) instead of blindly combining all the motions learned in the multiple hierarchical levels. Besides improving the prediction of complex motions, the
Adaptative feature combination module is also an explainability tool. The module allows us to visualize the influence of each learned feature on the predicted motion, providing a deeper understanding of the network’s internal workings.
The proposed method is trained in a self-supervised fashion, and it is tested on several datasets such as the
Mixamo synthetic human bodies activities dataset [
15], JPEG [
5], and CWIPC-SXR [
32] real-world human bodies datasets and compared against state-of-the-art methods. To extend such comparison, we also tested on a dataset of rigid objects (moving MNIST point cloud dataset [
7]) and a dataset of automobile scenes (Argoverse dataset [
3]). A powerful strength of our framework is the ability to extract the general dynamic behaviour of the point cloud as dynamic features. Since such features are useful for downstream tasks, we also tested the proposed architecture for the action recognition task on human bodies (MSRAction3D dataset [
19]). The proposed method outperforms the state-of-the-art methods in human bodies prediction and achieves on-par results for rigid objects and automobile scene prediction as well as for the action recognition task. The results demonstrated that our proposed method can leverage the structural relations between points to learn more accurate representations and preserve the point cloud shape during prediction. The results further show that the proposed
adaptative feature combination module predicts complex motions in human bodies with more accuracy than the current state-of-the-art approaches. Lastly, the code and datasets required to reproduce the work are made publicly available.
1In summary, the key contributions of our work are:
—
Understanding of current state-of-the-art frameworks key limitation for generating motion flow prediction. We show how the current approach is equivalent to combining learned local and global motions without regard to the point position in space and time and how this strategy fails to model the complex motions present in deformable objects.
—
A novel module that combines hierarchical features in an adaptative manner according to the scene context. The proposed module dynamically controls the composition of local and global motions for each point, allowing the network to predict complex motions with higher accuracy and flexibility. This also offers an explainability tool.
—
A graph-based module that exploits the point cloud geometric structure to form spatio-temporal neighbourhoods from where the meaningful dynamic features can be extracted. The structural information is further included in the learned dynamic features, reducing the deformation of the predicted point cloud shape.
The remainder of this article is organized as follows: In Section
2, we provide a state-of-the-art of research for point cloud prediction. In Section
3, we study the hierarchical component and identify the limitations of the state-of-the-art prediction framework. Based on the limitations identified in Section
4, we propose AGAR, an improved architecture with graph-RNN cells and a novel
Adaptative feature combination module. Section
5 describes implementation details. Finally, the experimental results and conclusion are presented in Sections
6 and
7, respectively.
3 Challenges and Limitations
The hierarchical point-based RNN framework, presented in the previous section, suffers several limitations when facing the challenge of processing deformable objects such as human-body-like sequences. In this article, we explain why those challenges arise and how to overcome them. In the following, we disentangle the challenges of current models as (i) challenges in processing/predicting objects with deformable shapes (Section
3.1); (ii) challenges in predicting complex motions (Section
3.2). Taking advantage of the understanding built in this section, in Section
4 we introduce our proposed method, built to overcome the main limitations identified here.
3.1 Challenges in Processing Deformable Shapes
The main challenges encountered in processing and predicting objects with deformable shapes, such as clothing, food, or human bodies are (i) having a semantically-meaningful point-to-point correspondence (used to learn dynamic features); (ii) avoiding shape distortion (which is highly noticeable in 3D objects and therefore of high negative impact on cloud prediction quality).
The challenge of establishing point-to-point correspondence is present in any point cloud processing, but it is clearly exacerbated in the case of deformable 3D objects. The majority of current works follow the same strategy as PointRNN [
7] and assume that the points in the current frame are matched with points in close proximity in the previous frame. This proximity is built in the 3D Euclidean space. However, in 3D deformable objects, points that are geometrically close in space are not necessarily semantically correlated and do not necessarily belong to the same segment of the object. Figure
3, shows three examples of how matching based on geometric proximity can lead to the creation of misleading neighbourhoods. This means that point correspondence across time is challenged by the mismatch between Euclidean proximity and semantically-meaningful proximity.
On the other hand, current methods often struggle to preserve the predicted point cloud shape. This is mainly due to the fact that a separate motion vector is learned for every point with no clear semantic constraints. If these motion vectors vary significantly among neighbouring points, the result is a prediction with a deformed shape. This issue can be tackled by imposing hard shape constraints, such as learning a single motion vector for all the points in a region. However, this strategy can only be applied to rigid objects. In deformable objects, the object shape changes according to different postures, meaning points must be allowed to have separate motions. Thus, it is important to strike a balance between preserving the shape and having enough per-point motion flexibility to predict possible shape variations. The key to achieving this balance is to capture the underlying semantic structure and take it into account as a soft shape constraint during the learning process.
Both challenges of point correspondence and shape deformation can be summarized in the following limitation: Lack of structural relationship between points in point cloud prediction (Limitation 1). Learning and exploiting this prior in the learning process is one of the novelties of our proposed model and it will be specifically addressed by learning a semantically-meaningful graph and exploiting this graph when extracting features (via graph-RNN cell).
3.2 Challenges in Processing Complex Motions
A second key challenge present in processing 3D dynamic objects as the human body is that the movement of such objects is usually a
complex motion. Complex motions refer to movements that involve a combination of multiple degrees of freedom such as translation, rotation, and deformation, which are applied to different parts of the object independently. This is typical of deformable objects or any 3D objects with disjoint components, each of them with its own movement. As an example, consider a point cloud representing a human body running forward (Figure
4(a)-
Man-Running). While the full body moves forward (translation), the person swings their arms (rotation), and their hand bends from an open to a closed position (shape change). The complex nature of such movements makes them challenging to be accurately captured and predicted. Based on a novel visualization technique that we introduced in our previous work [
12] on explainability, we now highlight key limitations of the current architectures. Specifically, we show how complex motions can be seen as a sum of low-, medium- and high-level motion leading to an understanding that the current model suffers from the following main limitation:
the fixed combination of hierarchical features in the prediction phase (Limitation 2). We now explain this limitation in more detail.
In our explainability work [
12], we have demonstrated that motion vectors inferred by hierarchical architectures (Figure
2) can be disentangled into individual motion vectors produced at each hierarchical level, as follows:
Where
\(Classic^{l}_{FP}\) is the function that replicates the operation of the
Classic-FP in a disentangle manner, converting the learned feature at each level
l to an individual motion vector
\(M_t^l\), and
\(M_t\) is the final predicted motion vectors outputted by the network. This leads to the interpretation that current approaches in the literature
model complex motions as a combination of local and global motions, which are learned as hierarchical dynamic features. This is illustrated in Figure
4, which depicts the dynamic features as motion vectors and the hierarchical neighbourhoods given two point cloud sequences as input to a state-of-the-art prediction architecture (presented in Figure
2) with three levels (
\(L=3\)) [
12]. In both sequences, it can be seen that the lower level learns features only by looking at points in a small area (top gold squares in the figure). In contrast, the higher level learns features by considering a sparser set of points in a large area (bottom blue squares in the figure). In the example in Figure
4(a), in which the runner’s foot performs a complex motion, it can be observed that the lowest level captures small and diverse motions (e.g., rotation of the heel)
\(M_t^1\), while the highest level learns the forward motion of the entire body
\(M_t^3\).
This interpretation of features as motion vectors can be generalized for the majority of current methods because while they differ in the feature extraction process, they all share the
Classic-FP strategy to perform the motion reconstruction process. As such we elaborate on this explainability technique to identify current state-of-the-art framework limitations to predict complex motions. Namely, the motion vector prediction is obtained by combining the dynamic features from the different levels via a learned weighted combination. However, each point motion is obtained using the same set of combination weights
\([\Theta ^{1}_{\text{FP}}, \ldots , \Theta ^{L}_{\text{FP}}]\) for
all points, frames, and sequences. As a result for every point, regardless of its position space and time, the predicted motion is obtained by the same fixed combination of local, medium and global motions. Based on this technique, we can understand that (i) different features can be associated with the different levels of motions forming the complex resultant motion, and (ii) knowing different parts of the objects might be subject to different types of movements highlights the strong limitation in having the same combination of motion levels. Specifically, while a set of weights might lead to the appropriate combination of the motion vectors in Figure
4(a), in which a local movement is analysed (foot), it does not hold in the case of the “
Woman-Running” sequence in Figure
4(b), in which a more global movement is highlighted (torso). The points in the lower torso perform a rigid movement forward corresponding to the global motion of the body, while the lower part of the body performs a quite dynamic rotation of the foot. This means that only the global motion vector (pointing forward) would be sufficient to describe the movement of the torso. However, local features (hence local motions) cannot be neglected since this would lead to neglecting the local motions in parts with strong local movement, such as the foot. As a result, in Figure
4(b) local motion vectors (
\(M_t^1\)) clearly lose any motion interpretation and become instead random vectors mainly used to compensate for the erroneous addition of multiple motion vectors in this part of the body.
It is worth mentioning that while this understanding might appear straightforward, to the best of our knowledge, this is the first work explaining PointRNN and similar hierarchical architectures when processing 3D deformable objects, showing the limitation in adopting a fixed combination of hierarchical features in the prediction phase. In the next section, we propose an architecture that overcomes this limitation by introducing an attention-based mechanism in the prediction phase.