Siggraph20-Robust Motion in Betweening
Siggraph20-Robust Motion in Betweening
FÉLIX G. HARVEY, Polytechnique Montreal, Canada, Mila, Canada, and Ubisoft Montreal, Canada
MIKE YURICK, Ubisoft Montreal, Canada
DEREK NOWROUZEZAHRAI, McGill University, Canada and Mila, Canada
CHRISTOPHER PAL, CIFAR AI Chair, Canada, Polytechnique Montreal, Canada, Mila, Canada, and Element AI, Canada
Fig. 1. Transitions automatically generated by our system between target keyframes (in blue). For clarity, only one in four generated frames is shown. Our
tool allows for generating transitions of variable lengths and for sampling different variations of motion given fixed keyframes.
In this work we present a novel, robust transition generation technique different transitions given fixed keyframes. To qualitatively evaluate our
that can serve as a new tool for 3D animators, based on adversarial recur- method, we present a custom MotionBuilder plugin that uses our trained
rent neural networks. The system synthesises high-quality motions that model to perform in-betweening in production scenarios. To quantitatively
use temporally-sparse keyframes as animation constraints. This is remi- evaluate performance on transitions and generalizations to longer time hori-
niscent of the job of in-betweening in traditional animation pipelines, in zons, we present well-defined in-betweening benchmarks on a subset of
which an animator draws motion frames between provided keyframes. We the widely used Human3.6M dataset and on LaFAN1, a novel high quality
first show that a state-of-the-art motion prediction model cannot be easily motion capture dataset that is more appropriate for transition generation.
converted into a robust transition generator when only adding condition- We are releasing this new dataset along with this work, with accompanying
ing information about future keyframes. To solve this problem, we then code for reproducing our baseline results.
propose two novel additive embedding modifiers that are applied at each
CCS Concepts: • Computing methodologies → Motion capture; Neural
timestep to latent representations encoded inside the network’s architecture.
networks.
One modifier is a time-to-arrival embedding that allows variations of the
transition length with a single model. The other is a scheduled target noise Additional Key Words and Phrases: animation, locomotion, transition gener-
vector that allows the system to be robust to target distortions and to sample ation, in-betweening, deep learning, LSTM
Authors’ addresses: Félix G. Harvey, Polytechnique Montreal, 2500 Chemin de la Poly- ACM Reference Format:
techique, Montreal, QC, H3T 1J4, Canada, Mila, 6666 St-Urbain Street, #200, Montreal, Félix G. Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal.
QC, H2S 3H1, Canada, Ubisoft Montreal, 5505 Boul Saint-Laurent, #2000, Montreal, QC,
H2T 1S6, Canada, felix.gingras-harvey@polymtl.ca; Mike Yurick, Ubisoft Montreal, 2020. Robust Motion In-betweening. ACM Trans. Graph. 39, 4, Article 60
5505 Boul Saint-Laurent, #2000, Montreal, QC, H2T 1S6, Canada, mike.yurick@ubisoft. (July 2020), 12 pages. https://doi.org/10.1145/3386569.3392480
com; Derek Nowrouzezahrai, McGill University, 3480 University St, Montreal, QC,
H3A 0E9, Canada, Mila, 6666 St-Urbain Street, #200, Montreal, QC, H2S 3H1, Canada,
derek@cim.mcgill.ca; Christopher Pal, CIFAR AI Chair, 661 University Ave., Suite 505, 1 INTRODUCTION
Toronto, ON, M5G 1M1, Canada, Polytechnique Montreal, 2500 Chemin de la Poly- Human motion is inherently complex and stochastic for long-term
techique, Montreal, QC, H3T 1J4, Canada, Mila, 6666 St-Urbain Street, #200, Montreal,
QC, H2S 3H1, Canada, Element AI, 6650 St-Urbain Street, #500, Montreal, QC, H2S 3G9, horizons. This is why Motion Capture (MOCAP) technologies still
Canada. often surpass generative modeling or traditional animation tech-
Permission to make digital or hard copies of all or part of this work for personal or
niques for 3D characters with many degrees of freedom. However,
classroom use is granted without fee provided that copies are not made or distributed in modern video games, the number of motion clips needed to prop-
for profit or commercial advantage and that copies bear this notice and the full citation erly animate a complex character with rich behaviors is often very
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or large and manually authoring animation sequences with keyframes
republish, to post on servers or to redistribute to lists, requires prior specific permission or using a MOCAP pipeline are highly time-consuming processes.
and/or a fee. Request permissions from permissions@acm.org. Some methods to improve curve fitting between keyframes [Cic-
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0730-0301/2020/7-ART60 $15.00 cone et al. 2019] or to accelerate the MOCAP workflow [Holden
https://doi.org/10.1145/3386569.3392480 2018] have been proposed to improve these processes. On another
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:2 • Harvey et al.
front, many auto-regressive deep learning methods that leverage Our system is tested in production scenarios by integrating a
high quality MOCAP for motion prediction have recently been pro- trained network in a custom plugin for Autodesk’s MotionBuilder,
posed [Barsoum et al. 2018; Chiu et al. 2019; Fragkiadaki et al. 2015; a popular animation software, where it is used to greatly acceler-
Gopalakrishnan et al. 2019; Jain et al. 2016; Martinez et al. 2017; ate prototyping and authoring new animations. In order to also
Pavllo et al. 2019]. Inspired by these achievements, we build in this quantitatively assess the performance of different methods on the
work a transition generation tool that leverages the power of Re- transition generation task, we present the LaFAN1 dataset, a novel
current Neural Networks (RNN) as powerful motion predictors to collection of high quality MOCAP sequences that is well-suited for
go beyond keyframe interpolation techniques, which have limited transition generation. We define in-betweening benchmarks on this
expressiveness and applicability. new dataset as well as on a subset of Human3.6M, commonly used
We start by building a state-of-the-art motion predictor based in the motion prediction literature. Our procedure stays close to
on several recent advances on modeling human motion with RNNs the common evaluation scheme used in many prediction papers
[Chiu et al. 2019; Fragkiadaki et al. 2015; Pavllo et al. 2019]. Using and defined by Jain et al. [2016], but differs on some important as-
a recently proposed target-conditioning strategy [Harvey and Pal pects. First, we provide error metrics that take into consideration the
2018], we convert this unconstrained predictor into a transition global root transformation of the skeleton, which provides a better
generator, and expose the limitations of such a conditioning strategy. assessment of the absolute motion of the character in the world.
These limitations include poor handling of transitions of different This is mandatory in order to produce and evaluate valid transitions.
lengths for a single model, and the inherent determinism of the Second, we train and evaluate the models in an action-agnostic fash-
architectures. The goal of this work is to tackle such problems in ion and report average errors on a large evaluation set, as opposed
order to present a new architecture that is usable in a production to the commonly used 8 sequences per action. We further report
environment. generalization results for transitions that are longer than those seen
To do so, we propose two different additive modifiers applied to during training. Finally, we also report the Normalized Power Spec-
some of the latent representations encoded by the network. The trum Similarity (NPSS) measure for all evaluations, as suggested
first one is a time-to-arrival embedding applied on the hidden repre- by Gopalakrishnan et al. [2019] which reportedly correlates better
sentation of all inputs. This temporal embedding is similar to the with human perception of quality.
positional encoding used in transformer networks [Vaswani et al. Our main contributions can thus be summarized as follow:
2017] in natural language modeling, but serves here a different role. • Latent additive modifiers to convert state-of-the-art motion
In our case, these embeddings evolve backwards in time from the predictors into robust transition generators:
target frame in order to allow the recurrent layer to have a contin- – A time-to-arrival embedding allowing robustness to varying
uous, dense representation of the number of timesteps remaining transition lengths,
before the target keyframe must be reached. This proves to be es- – A scheduled target-noise vector allowing variations in gen-
sential to remove artifacts such as gaps or stalling at the end of erated transitions,
transitions. The second embedding modifier is an additive sched- • New in-betweening benchmarks that take into account global
uled target noise vector that forces the recurrent layer to receive displacements and generalization to longer sequences,
distorted target embeddings at the beginning of long transitions. • LaFAN1, a novel high quality motion dataset well-suited for
The scheduled scaling reduces the norm of the noise during the motion prediction that we make publicly available with ac-
synthesis in order to reach the correct keyframe. This forces the companying code for reproducing our baseline results1 .
generator to be robust to noisy target embeddings. We show that it
can also be used to enforce stochasticity in the generated transitions 2 RELATED WORK
more efficiently than another noise-based method. We then further 2.1 Motion Control
increase the quality of the generated transitions by operating in the
Generative Adversarial Network (GAN) framework with two simple We refer to motion control here as scenarios in which temporally-
discriminators applied on different timescales. dense external signals, usually user-defined, are used to drive the
This results in a temporally-aware, stochastic, adversarial archi- generation of an animation. Even if the main application of the
tecture able to generate missing motions of variable length between present work is not focused on online control, many works on mo-
sparse keyframes of animation. The network takes 10 frames of tion control stay relevant to this research. Motion graphs [Arikan
past context and a single target keyframe as inputs and produces and Forsyth 2002; Beaudoin et al. 2008; Kovar et al. 2008; Lee et al.
a smooth motion that leads to the target, on time. It allows for 2002] allow one to produce motions by traversing nodes and edges
cyclic and acyclic motions alike and can therefore help generate that map to character states or motions segments from a dataset.
high-quality animations from sparser keyframes than what is usu- Safonova and Hodgins [Safonova and Hodgins 2007] combine an
ally allowed by curve-fitting techniques. Our model can fill gaps interpolated motion graph to an anytime 𝐴∗ search algorithm in
of an arbitrary number of frames under a soft upper-bound and order produce transitions that respect some constraints. Motion
we show that the particular form of temporal awareness we use is matching [Büttner and Clavet 2015] is another search driven mo-
key to achieve this without needing any smoothing post-process. tion control technique, where the current character pose and tra-
The resulting system allows us to perform robust, automatic in- jectory are matched to segments of animation in a large dataset.
betweening, or can be used to stitch different pieces of existing Chai & Hodgins, and Tautges et al. [2005; 2011] rely on learning
motions when blending is impossible or yields poor quality motion. 1 https://github.com/ubisoftinc/Ubisoft-LaForge-Animation-Dataset
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
Robust Motion In-betweening • 60:3
local PCA models on pose candidates from a motion dataset given Markov Decision Process where actions can correspond to actual
low-dimensional control signals and previously synthesized poses motion clips [Lee and Lee 2006; Treuille et al. 2007] or character
in order to generate the next motion frame. All these techniques states [Lee et al. 2010], but again requiring the motion dataset to
require a motion database to be loaded in memory or in the latter be loaded in memory at run-time. Physically-based control gets rid
cases to perform searches and learning at run-time, limiting their of this limitation by having the output of the system operate on a
scalability compared to generative models. physically-driven character. Coros et al. [2009] employ fitted value
Many machine learning techniques can mitigate these require- iteration with actions corresponding to optimized Proportional-
ments. Important work has used the Maximum A Posteriori (MAP) Derivative (PD) controllers proposed by Yin et al. [2007]. These RL
framework where a motion prior is used to regularize constraint(s)- methods operate on value functions that have discrete domains,
related objectives to generate motion. [Chai and Hodgins 2007] use which do not represent the continuous nature of motion and impose
a statistical dynamics model as a motion prior and user constraints, run-time estimations through interpolation.
such as keyframes, to generate motion. Min et al. [2009] use de- Deep RL methods, which use neural networks as powerful con-
formable motion models and optimize the deformable parameters at tinuous function approximators have recently started being used
run-time given the MAP framework. Other statistical models, such to address these limitations. Peng et al. [2017] apply a hierarchi-
as Gaussian Processes [Min and Chai 2012] and Gaussian Process La- cal actor-critic algorithm that outputs desired joint angles for PD-
tent Variable Models [Grochow et al. 2004; Levine et al. 2012; Wang controllers. Their approach is applied on a simplified skeleton and
et al. 2008; Ye and Liu 2010] have been applied to the constrained does not express human-like quality of movement despite their
motion control task, but are often limited by heavy run-time com- style constraints. Imitation-learning based RL approaches [Baram
putations and memory requirements that still scale with the size of et al. 2016; Ho and Ermon 2016] try to address this with adversarial
the motion database. As a result, these are often applied to separate learning, while others tackle the problem by penalizing distance of
types of motions and combined together with some post-process, a generated state from a reference state [Bergamin et al. 2019; Peng
limiting the expressiveness of the systems. et al. 2018]. Actions as animation clips, or control fragments [Liu
Deep neural networks can circumvent these limitations by al- and Hodgins 2017] can also be used in a deep-RL framework with
lowing huge, heterogeneous datasets to be used for training, while Q-learning to drive physically-based characters. These methods
having a fixed computation budget at run-time. Holden et al. [2016; show impressive results for characters having physical interactions
2015] use feed-forward convolutional neural networks to build a with the world, while still being limited to specific skills or short
constrained animation synthesis framework that uses root trajectory cyclic motions. We operate in our case in the kinematics domain
or end-effectors’ positions as control signals. Online control from and train on significantly more heterogeneous motions.
a gamepad has also been tackled with phase-aware [Holden et al.
2017], mode-aware [Zhang et al. 2018] and action-aware [Starke
et al. 2019] neural networks that can automatically choose a mixture 2.2 Motion Prediction
of network weights at run-time to disambiguate possible motions. We limit here the definition of motion prediction to generating un-
Recurrent Neural Networks (RNNs) on the other hand keep an constrained motion continuation given single or multiple frames of
internal memory state at each timestep that allows them to per- animation as context. This task implies learning a powerful motion
form naturally such disambiguation, and are very well suited for dynamics model which is useful for transition generation. Neural
modeling time series. Lee et al. [2018] train an RNN for interactive networks have shown over the years to excel in such representation
control using multiple control signals. These approaches [Holden learning. Early work from Taylor et al. [2007] using Conditional
et al. 2017, 2016; Lee et al. 2018; Zhang et al. 2018] rely on spatially Restricted Boltzmann Machines showed promising results on mo-
or temporally dense signals to constrain the motion and thus re- tion generation by sampling at each timestep the next frame of
duce ambiguity. In our system, a character might have to precisely motion conditioned on the current hidden state and 𝑛 previous
reach a temporally distant keyframe without any dense spatial or frames. More recently, many RNN-based approaches have been pro-
temporal information provided by the user during the transition. posed for motion prediction from a past-context of several frames,
The spatial ambiguity is mostly alleviated by the RNN’s memory motivated by the representational power of RNNs for temporal dy-
and the target-conditioning, while the timing ambiguity is resolved namics. Fragkiadki et al. [2015] propose to separate spatial encoding
in our case by time-to-arrival embeddings added to the RNN inputs. and decoding from the temporal dependencies modeling with the
Remaining ambiguity can be alleviated with generative adversarial Encoder-Recurrent-Decoder (ERD) networks, while Jain et al. [2016]
training [Goodfellow et al. 2014], in which the motion generator apply structural RNNs to model human motion sequences repre-
learns to fool an additional discriminator network that tries to dif- sented as spatio-temporal graphs. Other recent approaches [Chiu
ferentiate generated sequences from real sequences. Barsoum et et al. 2019; Gopalakrishnan et al. 2019; Liu et al. 2019; Martinez
al. [2018] and Gui et al. [2018] both design new loss functions for et al. 2017; Pavllo et al. 2019; Tang et al. 2018] investigate new ar-
human motion prediction, while also using adversarial losses using chitectures and loss functions to further improve short-term and
different types of discriminators. These losses help reduce artifacts long-term prediction of human motion. Others [Ghosh et al. 2017;
that may be produced by generators that average different modes Li et al. 2017] investigate ways to prevent divergence or collapsing
of the plausible motions’ distribution. to the average pose for long-term predictions with RNNs. In this
Motion control has also been addressed with Reinforcement work, we start by building a powerful motion predictor based on
Learning (RL) approaches, in which the problem is framed as a the state-of-the-art recurrent architecture for long-term prediction
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:4 • Harvey et al.
proposed by Chiu et al. [2019]. We combine this architecture with and yields better performance. When computing our positional loss,
the feed-forward encoders of Harvey et al. [2018] applied to dif- we reformat the predicted state into a global positions vector p𝑡 +1
ferent parts of the input to allow our embedding modifiers to be using q𝑡 +1 , r𝑡 +1 and the stored, constant local bone translations b by
applied on distinct parts of the inputs. In our case, we operate on performing Forward Kinematics (FK). The resulting vector p𝑡 +1 has
joint-local quaternions for all bones, except for the root, for which 𝑗 ∗ 3 dimensions. We also retrieve through FK the global quaternions
we use quaternions and translations local to the last seed frame. vector g𝑡 +1 , which we use for quantitatively evaluating transitions.
The discriminator use as input sequences of 3-dimensional vectors
2.3 Transition generation of global root velocities r¤ , concatenated with x and x, ¤ the root-
We define transition generation as a type of control with temporally relative positions and velocities of all other bones respectively. The
sparse spatial constraints, i.e. where large gaps of motion must be vectors x and x¤ both have ( 𝑗 − 1) ∗ 3 dimensions.
filled without explicit conditioning during the missing frames such To simplify the learning process, we rotate each input sequence
as trajectory or contact information. This is related to keyframe seen by the network around the 𝑌 axis (up) so that the root of the
or motion interpolation (e.g. [Ciccone et al. 2019]), but our work skeleton points towards the 𝑋 + axis on the last frame of past context.
extends interpolation in that the system allows for generating whole Each transition thus starts with the same global horizontal facing.
cycles of motion, which cannot be done by most key-based inter- We refer to rotations and positions relative to this frame as global
polation techniques, such a spline fitting. Pioneering approaches in the rest of this work. When using the network inside a content
[Cohen et al. 1996; Witkin and Kass 1988] on transition generation creation software, we store the applied rotation in order to rotate
and interpolation used spacetime constraints and inverse kinematics back the generated motion to fit the context. Note however that this
to produce physically-plausible motion between keyframes. Work has no effect on the public dataset Human3.6M since root transfor-
with probabilistic models of human motion have also been used mations are set to the identity on the first frame of any sequences,
for filling gaps of animation. These include the MAP optimizers regardless of the actual world orientation. We also augment the data
of Chai et al. [2007] and Min et al. [2009], the Gaussian process by mirroring the sequences over the 𝑋 + axes with a probability of
dynamical models from Wang et al. [2008] and Markov models with 0.5 during training.
dynamic auto-regressive forests from Lehrmann et al. [2014]. All
of these present specific models for given action and actors. This
can make combinations of actions look scripted and sequential. The 3.2 Transition Generator
scalability and expressiveness of deep neural networks has been Figure 2 presents a visual depiction of our recurrent generator for a
applied to keyframe animation by Zhang et al. [2018], who use an single timestep. It uses the same input separation used by the RTN
RNN conditioned on key-frames to produce jumping motions for a network [Harvey and Pal 2018], but operates on angular data and
simple 2D model. Harvey et al. [2018] present Recurrent Transition uses FK in order to retrieve global positions [Pavllo et al. 2019]. It
Networks (RTN) that operate on a more complex human charac- is also augmented with our latent space modifiers ztta and ztarget .
ter, but work on fixed-lengths transitions with positional data only Finally it also uses different losses, such as an adversarial loss for
and are deterministic. We use the core architecture of the RTN as improved realism of the generated motions.
we make use of the separately encoded inputs to apply our latent As seen in Figure 2, the generator has three different encoders
modifiers. Hernandez et al. [2019] recently applied convolutional that take the different data vectors described above as inputs; the
adversarial networks to pose the problem of prediction or transition character state encoder, the offset encoder, and the target encoder.
generation as an in-painting one, given the success of convolutional The encoders are all fully-connected Feed-Forward Networks (FFN)
generative adversarial networks on such tasks. They also propose with a hidden layer of 512 units and an output layer of 256 units.
frequency-based losses to assess motion quality, but do not provide All layers use the Piecewise Linear Activation function (PLU) [Nico-
a detailed evaluation for the task of in-betweening. lae 2018], which performed slightly better than Rectified Linear
Units (ReLU) in our experiments. The time-to-arrival embedding
3 METHODS z𝑡𝑡𝑎 has 256 dimensions and is added to the latent input represen-
3.1 Data formatting tations. Offset and target embeddings h𝑡
offset target
and h𝑡 are then
We use a humanoid skeleton that has 𝑗 = 28 joints when using concatenated and added to the 512-dimensional target-noise vector
the Human3.6M dataset and 𝑗 = 22 in the case of the LaFAN1 ztarget . Next, the three augmented embeddings are concatenated
dataset. We use a local quaternion vector q𝑡 of 𝑗 ∗ 4 dimensions and fed as input to a recurrent Long-Short-Term-Memory (LSTM)
as our main data representation along with a 3-dimensional global layer. The embedding from the recurrent layer, h𝑡LSTM is then fed
root velocity vector r¤𝑡 at each timestep 𝑡. We also extract from to the decoder, another FFN with two PLU hidden layers of 512
the data, based on toes and feet velocities, contact information as a and 256 units respectively and a linear output layer. The resulting
binary vector c𝑡 of 4 dimensions that we use when working with the output is separated into local-quaternion and root velocities q¤̂ 𝑡 +1
𝑞
LaFAN1 dataset. The offset vectors o𝑟𝑡 and o𝑡 contain respectively and r¤̂𝑡 +1 to retrieve the next character state. When working with the
the global root position’s offset and local-quaternions’ offsets from LaFAN1 dataset, the decoder has four extra output dimensions that
the target keyframe at time 𝑡. Even though the quaternion offset go through a sigmoid non-linearity 𝜙 to retrieve contact predictions
could be expressed as a valid, normalized quaternion, we found that ĉ𝑡 +1 . The estimated quaternions q̂𝑡 +1 are normalized as valid unit
using simpler element-wise linear differences simplifies learning quaternions and used along with the new root position r̂𝑡 +1 and the
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
Robust Motion In-betweening • 60:5
where tta is the number of timesteps until arrival and the second
subscript of the vector ztta,_ represents the dimension index. The
value 𝑑 is the dimensionality of the input embeddings, and 𝑖 ∈
[0, ..., 𝑑/2]. The basis component influences the rate of change in
frequencies along the embedding dimensions. It is set to 10,000 as
in most transformer implementations.
Time-to-arrival embeddings thus provide continuous codes that
will shift input representations in the latent space smoothly and
uniquely for each transition step due to the phase and frequency
shifts of the sinusoidal waves on each dimension. Such embedding is
thus bounded, smooth and dense, three characteristics beneficial for
learning. Its additive nature makes it harder for a neural network to
ignore, as can be the case with concatenation methods. This follows
the successful trend in computer vision [Dumoulin et al. 2017; Perez
et al. 2018] of conditioning through transformations of the latent
space instead of conditioning with input concatenation. In these
cases, the conditioning signals are significantly more complex and
the affine transformations need to be learned, whereas Vaswani et
al. [2017] report similar performance when using this sine-based
formulation as when using learned embeddings.
It is said that positional encodings can generalize to longer se-
quences in the natural language domain. However, since ztta evolves
backwards in time to retrieve a time-to-arrival representation, gen-
Fig. 2. Overview of the TGcomplete architecture for in-betweening.
eralizing to longer sequences becomes a more difficult challenge.
Computations for a single timestep are shown. Visual concatenation of
input boxes or arrows represents vector concatenation. Green boxes are Indeed, in the cases of Transformers (without temporal reversal), the
the jointly trained neural networks. Dashed boxes represent our two pro- first embeddings of the sequence are always the same and smoothly
posed embedding modifiers. The "quat norm" and "FK" red boxes represent evolve towards new ones when generalizing to longer sequences. In
the quaternion normalization and Forward Kinematics operations respec- our case, longer sequences change drastically the initial embedding
tively. The ⊕ sign represents element-wise addition and 𝜙 is the sigmoid seen and may thus generate unstable hidden states inside the recur-
non-linearity. Outputs are linked to associated losses with dashed lines. rent layer before the transition begins. This can hurt performance
on the first frames of transitions when extending the time-horizon
constant bone offsets b to perform FK and retrieve the new global after training. To alleviate this problem, we define a maximum
positions p̂𝑡 +1 . duration in which we allow ztta to vary, and fix it past this maxi-
mum duration. Precisely, the maximum duration 𝑇max (ztta ) is set
3.3 Time-to-arrival embeddings to 𝑇𝑚𝑎𝑥 (trans) +𝑇𝑝𝑎𝑠𝑡 − 5, where 𝑇max (trans) is the maximum tran-
We present here our method to allow robustness to variable lengths sition length seen during training and 𝑇past is the number of seed
of in-betweening. In order to achieve this, simply adding condition- frames given before the transition. This means that when dealing
ing information about the target keyframe is insufficient since the with transitions of length 𝑇max (trans), the model sees a constant
recurrent layer must be aware of the number of frames left until ztta for 5 frames before it starts to vary. This allows the network to
the target must be reached. This is essential to produce a smooth handle a constant ztta and to keep the benefits of this augmentation
transition without teleportation or stalling. Transformer networks even when generalizing to longer transitions. Visual representations
[Vaswani et al. 2017] are attention-based models that are increas- of ztta and the effects 𝑇max (ztta ) are shown in Appendix A.2.
ingly used in natural language processing due to their state-of-the- We explored simpler approaches to induce temporal awareness,
art modeling capacity. They are sequence-to-sequence models that such as concatenating a time-to-arrival dimension either to the in-
do not use recurrent layers but require positional encodings that puts of the state encoder, or to the LSTM layer’s inputs. This 𝑡𝑡𝑎
modify a word embedding to represent its location in a sentence. dimension is a single scalar increasing from 0 to 1 during the transi-
Our problem is also a sequence-to-sequence task where we translate tion. Its period of increase is set to 𝑇max (ztta ). Results comparing
a sequence of seed frames to a transition sequence, with additional these methods with a temporally unaware network, and our use of
conditioning on the target keyframe. Although our generator does ztta can be visualized in Figure 3.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:6 • Harvey et al.
Fig. 3. Reducing the L2Q loss with ztta . We compare simple interpolation
with our temporally unaware model (TG-Q) on the walking subset of Human
3.6M. We further test two strategies based on adding a single 𝑡𝑡𝑎 dimension
either to the character state (TG-Q + 𝑡𝑡𝑎 input scalar) or the LSTM inputs
(TG-Q + 𝑡𝑡𝑎 LSTM scalar). Finally, our use of time-to-arrival embeddings (TG-
Q + ztta ) yields the best results, mostly noticeable at the end of transitions,
where the generated motion is smoother than interpolation.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
Robust Motion In-betweening • 60:7
where 𝑇 is the sequence length. The two main losses that we use and with receptive fields of 1 in the last 2 layers, yielding parallel
are the local-quaternion loss 𝐿quat and the root-position loss 𝐿root . feed-forward networks for each motion window in the sequence.
The former is computed over all joints’ local rotations, including
the root-node, which in this case also determines global orientation. 3.7.3 Hyperparameters. In all of our experiments, we use mini-
The latter is responsible for the learning of the global root displace- batches of 32 sequences of variable lengths as explained above. We
ment. As an additional reconstruction loss, we use a positional-loss use the AMSgrad optimizer [Reddi et al. 2018] with a learning rate
𝐿pos that is computed on the global position of each joints retrieved of 0.001 and adjusted parameters (𝛽 1 = 0.5, 𝛽2 = 0.9) for increased
through FK. In theory, the use of 𝐿pos isn’t necessary to achieve a stability. We scale all of our losses to be approximately equal on the
perfect reconstruction of the character state when using 𝐿quat and LaFAN1 dataset for an untrained network before tuning them with
𝐿root , but as noted by Pavllo et al. [2019], using global positions custom weights. In all of our experiments, these relative weights
helps to implicitly weight the orientation of the bone’s hierarchy (when applicable) are of 1.0 for 𝐿quat and 𝐿root , 0.5 for 𝐿pos , and 0.1
for better results. As we will show in Section 4.2, adding this loss for 𝐿gen and 𝐿contacts . The target noise’s standard deviation 𝜎target
indeed improves results on both quaternion and translation recon- is 0.5. In experiments on Human3.6M, we set 𝑛 ep−max to 5 while it
structions. Finally, in order to allow for runtime Inverse-Kinematics is set to 3 on the larger LaFAN1 dataset.
correction (IK) of the legs inside an animation software, we also
use a contact prediction loss 𝐿contacts , between predicted contacts 4 EXPERIMENTS AND RESULTS
ĉ𝑡 and true contacts c𝑡 . We use the contact predictions at runtime
to indicate when to perform IK on each leg. This loss is used only 4.1 Motion prediction
for models trained on the LaFAN1 dataset and that are deployed in Based on recent advances in motion prediction, we first build a
our MotionBuilder plugin. motion prediction network that yields state-of-the-art results. We
evaluate our model on the popular motion prediction benchmark
3.6.2 Adversarial Losses. We use the Least Square GAN (LSGAN) that uses the Human 3.6M dataset. We follow the evaluation protocol
formulation [Mao et al. 2017]. As our discriminators operate on defined by Jain et al. [2016] that we base on the code from Martinez
sliding windows of motion, we average their losses over time. Our et al. [2017]. We train the networks for 40,500 iterations before
LSGAN losses are defined as follows: evaluation. We use the core architecture of Harvey et al. [2018]
1 since their separate encoders allow us to apply our embedding
𝐿gen = EX ,X ∼𝑝 [(𝐷 (Xp , 𝐺 (𝑋 p , Xf ), Xf ) − 1) 2 ], (8) modifiers. In the case of unconstrained prediction however, this is
2 p f 𝐷𝑎𝑡𝑎
1 more similar to the Encoder-Recurrent-Decoder (ERD) networks
𝐿disc = EXp ,Xtrans ,Xf ∼𝑝 𝐷𝑎𝑡𝑎 [(𝐷 (Xp , Xtrans , Xf ) − 1) 2 ] from Fragkiadaki et al. [2015]. We also apply the velocity-based
2
1 input representation of Chiu et al. [2019] which seems to be a key
+ EXp ,Xf ∼𝑝 𝐷𝑎𝑡𝑎 [(𝐷 (Xp , 𝐺 (𝑋 p , Xf ), Xf )) 2 ], (9) component to improve performance. This is empirically shown
2
in our experiments for motion prediction, but as we will see in
where Xp , Xf , and Xtrans represent the past context, target state, and Section 4.2, it doesn’t hold for transition generation, where the
transition respectively, in the discriminator input format described character state as an input is more informative than velocities to
in Section 3.1. 𝐺 is the transition generator network. Both discrimi- produce correct transitions, evaluated on global angles and positions.
nators use the same loss, with different input sequence lengths. Another difference lies in our data representation, which is based on
quaternions instead of exponential maps. We call our architecture
3.7 Training for motion prediction ERD-Quaternion Velocity network (ERD-QV).
3.7.1 Progressive growing of transitions. In order to accelerate train- This model is therefore similar to the one depicted in Figure 2, with
ing, we adopt a curriculum learning strategy with respect to the quaternions velocities q¤ 𝑡 as only inputs of the state encoder instead
transition lengths. Each training starts at the first epoch with 𝑃min = of q𝑡 and r¤𝑡 , and without the two other encoders and their inputs.
𝑃˜max = 5, where 𝑃 min and 𝑃˜max are the minimal and current maxi- No embedding modifier and no FK are used in this case, and the
mal transition lengths. During training, we increase 𝑃˜max until it only loss used is the L1 norm on joint-local quaternions. In this
reaches the true maximum transition length 𝑃 max . The increase evaluation, the root transform is ignored, to be consistent with
rate is set by number of epochs 𝑛 ep−max by which we wish to have previous works.
reached 𝑃˜max = 𝑃max . For each minibatch, we sample uniformly In Table 1, we compare this model with the TP-RNN which ob-
the current transition length between 𝑃min and 𝑃˜max , making the tains to our knowledge state-of-the-art results for Euler angle differ-
network train with variable length transitions, while beginning the ences. We also compare with two variants of the VGRU architecture
training with simple tasks only. In our experiments, this leads to proposed by Gopalakrishnan et al. [2019], who propose a novel
similar results as using any teacher forcing strategy, while accelerat- Normalized Power Spectrum Similarity (NPSS) metric for motion
ing the beginning of training due to the shorter batches. Empirically, prediction that is more correlated to human assessment of quality
it also outperformed gradient clipping. At evaluation time, the tran- for motion. Note that in most cases, we improve upon the TP-RNN
sition length is fixed to the desired length. for angular errors and perform better than the VGRU-d proposed by
Gopalakrishnan et al. [2019] on their proposed metric. This allows
3.7.2 Sliding critics. In practice, our discriminators are implemented us to confirm the performance of our chosen architecture as the
as 1D temporal convolutions, with strides of 1, without padding, basis of our transition generation model.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:8 • Harvey et al.
Table 1. Unconstrained motion prediction results on Human 3.6M. The VGRU-d/rl models are from [Gopalakrishnan et al. 2019]. The TP-RNN is from
[Chiu et al. 2019] and has to our knowledge the best published results on motion prediction for this benchmark. Our model, ERD-QV is competitive with the
state-of-the-art on angular errors and improves performance with respect to the recently proposed NPSS metric on all actions.
4.2 Walking in-betweens on Human 3.6M of the root’s XZ positions on all joints’ XZ positions. We report the
Given our highly performing prediction architecture, we now build L2P metric as it is arguably a better metric than any angular loss
upon it to produce a transition generator (TG). We start off by first for assessing visual quality of transitions with global displacements.
adding conditioning information about the future target and cur- However, it is not complete in that bone orientations might be wrong
rent offset to the target, and then sequentially add our proposed even with the right positions. We also report NPSS scores, which
contributions to show their quantitative benefits on a novel tran- are based on angular frequency comparisons with the ground truth.
sition benchmark. Even though the Human 3.6M dataset is one of Our results are shown in Table 2. Our first baseline consists of a
the most used in motion prediction research, most of the actions it
Table 2. Transition generation benchmark on Human 3.6M. Models
contains are ill-suited for long-term prediction or transitions (e.g.
were trained with transition lengths of maximum 50 frames, but are evalu-
smoking, discussion, phoning, ...) as they consists of sporadic, ran-
ated beyond this horizon, up to 100 frames (4 seconds).
dom short movements that are impossible to predict beyond some
short time horizons. We thus choose to use a subset of the Human L2Q
3.6M dataset consisting only of the three walk-related actions (walk- Length (frames) 5 10 25 50 75 100 AVG
ing, walkingdog, walkingtogether) as they are more interesting to Interpolation 0.22 0.43 0.84 1.09 1.48 2.03 1.02
TG-QV 0.36 0.51 0.76 1.08 1.54 1.97 1.04
test for transitions over 0.5 seconds long. Like previous studies, we
TG-Q 0.33 0.48 0.76 1.05 1.40 1.79 0.97
work with a 25Hz sampling rate and thus subsample the original +𝐿pos 0.32 0.45 0.74 1.04 1.40 1.80 0.97
+ztta 0.26 0.40 0.70 0.96 1.30 1.67 0.88
50Hz data. The walking data subset has 55,710 frames in total and +ztarget 0.26 0.40 0.68 0.94 1.22 1.56 0.84
we keep Subject 5 as the test subject. The test set is composed of +𝐿gen (TGcomplete ) 0.24 0.38 0.68 0.93 1.20 1.49 0.82
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
Robust Motion In-betweening • 60:9
the global positional loss (+𝐿pos ) as an additional training signal, Table 3. Improving in-betweening on the LaFAN1 dataset. Models
which improves performance on most metrics and lengths. We then were trained with transition lengths of maximum 30 frames (1 second),
and are evaluated on 5, 15, 30, and 45 frames.
add the unconstrained time-to-arrival embedding modifier (+ztta )
and observe our most significant improvement. These effects on 50-
L2Q
frames translations are summarized in Figure 3. Next, we evaluate
the effects of our scheduled target embedding modifier ztarget . Note Length (frames) 5 15 30 45
that it is turned off for quantitative evaluation. The effects are minor Interpolation 0.22 0.62 0.98 1.25
for transitions of 5 and 10 frames, but ztarget is shown to generally TGrec 0.21 0.48 0.83 1.20
improve performances for longer transitions. We argue that these TGcomplete 0.17 0.42 0.69 0.94
improvements come from the fact that this target noise probably L2P
helps generalizing to new sequences as it improves the model’s Interpolation 0.37 1.25 2.32 3.45
robustness to new or noisy conditioning information. Finally, we TGrec 0.32 0.85 1.82 3.89
obtain our complete model (TGcomplete ) by adding our adversarial TGcomplete 0.23 0.65 1.28 2.24
loss 𝐿gen , which interestingly not only improves the visual results NPSS
of the generated motions, but also most of the quantitative scores. Interpolation 0.0023 0.0391 0.2013 0.4493
Qualitatively, enabling the target noise allows the model to pro- TGrec 0.0025 0.0304 0.1608 0.4547
duce variations of the same transitions, and it is trivial to control TGcomplete 0.0020 0.0258 0.1328 0.3311
the level of variation by controlling 𝜎target . We compare our ap-
proach to a simpler variant that also aims at inducing stochasticity
in the generated transition. In this variant, we aim at potentially measurements. On this larger dataset with more complex move-
disambiguating the missing target information such as velocities by ments, our proposed in-betweeners surpass interpolation even on
concatenating a random noise vector zconcat to the target keyframe the very short transitions, as opposed to what was observed on the
input q𝑇 . This is similar to a strategy used in conditional GANs to Human3.6M walking subset. This motivates the use of our system
avoid mode collapse given the condition. Figure 4 and the accom- even on short time-horizons.
panying video show typical results obtained with our technique
against this more classical technique.
4.4 Practical use inside an animation software
In order to also qualitatively test our models, we deploy networks
trained on LaFAN1 in a custom plugin inside Autodesk’s Motion-
4.3 Scaling up with the LaFAN1 dataset Builder, a widely used animation authoring and editing software.
Given our model selection based on the Human 3.6M walking bench- This enables the use of our model on user-defined keyframes or the
mark discussed above, we further test our complete model on a generation of transitions between existing clips of animation. Figure
novel, high quality motion dataset containing a wide range of ac- 5 shows an example scene with an incomplete sequence alongside
tions, often with significant global displacements interesting for our user interface for the plugin. The Source Character is the one
in-betweening compared to the Human3.6M dataset. This dataset
contains 496,672 motion frames sampled at 30Hz and captured in
a production-grade MOCAP studio. It contains actions performed
by 5 subjects, with Subject 5 used as the test set. Similarly to the
procedure used for the Human3.6M walking subset, our test set
is made of regularly-sampled motion windows. Given the larger
size of this dataset we sample our test windows from Subject 5 at
every 40 frames, and thus retrieve 2232 windows for evaluation.
The training statistics for normalization are computed on windows
of 50 frames offset by 20 frames. Once again our starting baseline is
a normal interpolation. We make public this new dataset along with
accompanying code that allows one to extract the same training set
and statistics as in this work, to extract the same test set, and to eval-
uate naive baselines (zero-velocity and interpolation) on this test
set for our in-betweening benchmark. We hope this will facilitate
future research and comparisons on the task of transition genera-
tion. We train our models on this dataset for 350,000 iterations on
Subjects 1 to 4. We then go on to compare a reconstruction-based, Fig. 5. Generating animations inside MotionBuilder. On the left is a
scene where the last seed frame and target keyframe are visible. On the
future-conditioned Transition Generator (TGrec ) using 𝐿quat , 𝐿root ,
right is our user interface for the plugin that allows, among other things, to
𝐿pos and 𝐿contacts with our augmented adversarial Transition Gen- specify the level of scheduled target noise for the next generation through
erator (TGcomplete ) that adds our proposed embedding modifiers the variation parameter, and to use the network’s contact predictions to
ztta , ztta and our adversarial loss 𝐿gen . Results are presented in Ta- apply IK. On the bottom is the timeline where the gap of missing motion is
ble 3. Our contributions improve performance on all quantitative visible.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:10 • Harvey et al.
from which keyframes are extracted while the generated frames Table 4. Speed performance summary of our MotionBuilder plugin.
are applied onto the Target Character’s skeleton. In this setup it is The model inference also includes the IK postprocess. The last column
indicates the time taken to produce a string of 10 transitions of 30 frames.
trivial to re-sample different transitions while controlling the level
Everything is run on a Intel Xeon CPU E5-1650 @ 3.20GHz.
of target noise through the Variation parameter. A variation of 0
makes the model deterministic. Changing the temporal or spatial
Transition time (s) 0.50 1.00 2.00 10 x 1.00
location of the target keyframes and producing new animations is
also trivial. Such examples of variations can be seen in Figure 6. Keyframe extraction (s) 0.01 0.01 0.01 0.01
The user can decide to apply IK guided by the network’s contact Model inference (s) 0.30 0.31 0.31 0.40
Applying keyframes (s) 0.72 1.05 1.65 6.79
predictions through the Enable IK checkbox. An example of the
Total (s) 1.03 1.37 1.97 7.20
workflow and rendered results can be seen in the accompanying
video.
5 DISCUSSION
5.1 Additive modifiers
We found our time-to-arrival and scheduled target noise additive
modifiers to be very effective for robustness to time variations and
for enabling sampling capabilities. We explored relatively simpler
concatenation-based methods that showed worse performances. We
hypothesize that concatenating time-to-arrival or noise dimensions
is often less efficient because the neural network can learn to ignore
those extra dimensions which are not crucial in the beginning of
the training. Additive embedding modifiers however impose a shift
in latent space and thus are harder to bypass.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
Robust Motion In-betweening • 60:11
5.5 Recurrent cell types Jinxiang Chai and Jessica K Hodgins. 2005. Performance animation from low-
dimensional control signals. In ACM Transactions on Graphics (ToG), Vol. 24. ACM,
Some recent works on motion prediction prefer Gated Recurrent 686–696.
Units (GRU) over LSTMs for their lower parameter count, but our Jinxiang Chai and Jessica K Hodgins. 2007. Constraint-based motion optimization using
a statistical dynamic model. ACM Transactions on Graphics (TOG) 26, 3 (2007), 8.
empirical performance comparisons favored LSTMs over GRUs. Hsu-kuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang, and Juan Carlos Niebles.
2019. Action-agnostic human pose forecasting. In 2019 IEEE Winter Conference on
Applications of Computer Vision (WACV). IEEE, 1423–1432.
6 LIMITATIONS AND FUTURE WORK Loïc Ciccone, Cengiz Öztireli, and Robert W. Sumner. 2019. Tangent-space Optimization
A more informative way of representing the current offset to the for Interactive Animation Control. ACM Trans. Graph. 38, 4, Article 101 (July 2019),
10 pages. https://doi.org/10.1145/3306346.3322938
target o𝑡 would be to include positional-offsets in the representa- Michael Cohen, Brian Guenter, Bobby Bodenheimer, and Charles Rose. 1996.
tion. For this to be informative however, it would need to rely on Efficient Generation of Motion Transitions Using Spacetime Constraints.
In SIGGRAPH 96 (siggraph 96 ed.). Association for Computing Machinery,
character-local or global positions, which require FK. Although it is Inc. https://www.microsoft.com/en-us/research/publication/efficient-generation-
possible to perform FK inside the network at every step of genera- of-motion-transitions-using-spacetime-constraints/
tion, the backward pass during training becomes prohibitively slow Stelian Coros, Philippe Beaudoin, and Michiel van de Panne. 2009. Robust task-based
control policies for physics-based characters. In ACM Transactions on Graphics
justifying our use of root offset and rotational offsets only. (TOG), Vol. 28. ACM, 170.
As with many data-driven approaches, our method struggles to Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. 2017. A Learned Repre-
generate transitions for which conditions are unrealistic, or outside sentation For Artistic Style. ICLR (2017). https://arxiv.org/abs/1610.07629
Katerina Fragkiadaki, Sergey Levine, Panna Felsen, and Jitendra Malik. 2015. Recur-
the range covered by the training set. rent network models for human dynamics. In Proceedings of the IEEE International
Our scheduled target noise allows us to modify to some extent Conference on Computer Vision. 4346–4354.
Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. 2017. Learning Human
the manner in which a character reaches its target, reminiscent Motion Models for Long-term Predictions. arXiv preprint arXiv:1704.02827 (2017).
of changing the style of the motion, but doesn’t allow yet to have Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
control over those variations. Style control given a fixed context Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In
Advances in neural information processing systems. 2672–2680.
would be very interesting but is out of scope of this work. Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G Ororbia.
2019. A neural temporal model for human motion prediction. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 12116–12125.
7 CONCLUSION Keith Grochow, Steven L Martin, Aaron Hertzmann, and Zoran Popović. 2004. Style-
In this work we first showed that state-of-the-art motion predictors based inverse kinematics. In ACM transactions on graphics (TOG), Vol. 23. ACM,
522–531.
cannot be converted into robust transition generators by simply Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José MF Moura. 2018. Adversarial
adding conditioning information about the target keyframe. We geometry-aware human motion prediction. In Proceedings of the European Conference
on Computer Vision (ECCV). 786–803.
proposed a time-to-arrival embedding modifier to allow robust- Félix G Harvey and Christopher Pal. 2018. Recurrent transition networks for character
ness to transition lengths, and a scheduled target noise modifier locomotion. In SIGGRAPH Asia 2018 Technical Briefs. ACM, 4.
to allow robustness to target keyframe variations and to enable Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human Motion
Prediction via Spatio-Temporal Inpainting. In Proceedings of the IEEE International
sampling capabilities in the system. We showed how such a system Conference on Computer Vision. 7134–7143.
allows animators to quickly generate quality motion between sparse Jonathan Ho and Stefano Ermon. 2016. Generative adversarial imitation learning. In
keyframes inside an animation software. We also presented LaFAN1, Advances in Neural Information Processing Systems. 4565–4573.
Daniel Holden. 2018. Robust solving of optical motion capture data by denoising. ACM
a new high quality dataset well suited for transition generation Transactions on Graphics (TOG) 37, 4 (2018), 165.
benchmarking. Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-functioned neural networks
for character control. ACM Transactions on Graphics (TOG) 36, 4 (2017), 42.
Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for
ACKNOWLEDGMENTS character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35, 4
(2016), 138.
We thank Ubisoft Montreal, the Natural Sciences and Engineering Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion
Research Council of Canada and Mitacs for their support. We also manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs.
ACM, 18.
thank Daniel Holden, Julien Roy, Paul Barde, Marc-André Carbon- Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN:
neau and Olivier Pomarez for their support and valuable feedback. Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. 5308–5317.
Lucas Kovar, Michael Gleicher, and Frédéric Pighin. 2008. Motion graphs. In ACM
REFERENCES SIGGRAPH 2008 classes. ACM, 51.
Jehee Lee, Jinxiang Chai, Paul SA Reitsma, Jessica K Hodgins, and Nancy S Pollard.
Okan Arikan and David A Forsyth. 2002. Interactive motion generation from examples. 2002. Interactive control of avatars animated with human motion data. In ACM
In ACM Transactions on Graphics (TOG), Vol. 21. ACM, 483–490. Transactions on Graphics (ToG), Vol. 21. ACM, 491–500.
Nir Baram, Oron Anschel, and Shie Mannor. 2016. Model-based Adversarial Imitation Jehee Lee and Kang Hoon Lee. 2006. Precomputing avatar behavior from human motion
Learning. arXiv preprint arXiv:1612.02179 (2016). data. Graphical Models 68, 2 (2006), 158–174.
Emad Barsoum, John Kender, and Zicheng Liu. 2018. HP-GAN: Probabilistic 3D human Kyungho Lee, Seyoung Lee, and Jehee Lee. 2018. Interactive character animation by
motion prediction via GAN. In Proceedings of the IEEE Conference on Computer Vision learning multi-objective control. In SIGGRAPH Asia 2018 Technical Papers. ACM,
and Pattern Recognition Workshops. 1418–1427. 180.
Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion- Yongjoon Lee, Kevin Wampler, Gilbert Bernstein, Jovan Popović, and Zoran Popović.
motif graphs. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium 2010. Motion fields for interactive character locomotion. In ACM Transactions on
on Computer Animation. Eurographics Association, 117–126. Graphics (TOG), Vol. 29. ACM, 138.
Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: Andreas M Lehrmann, Peter V Gehler, and Sebastian Nowozin. 2014. Efficient nonlinear
data-driven responsive control of physics-based characters. ACM Transactions on markov models for human motion. In Proceedings of the IEEE Conference on Computer
Graphics (TOG) 38, 6 (2019), 1–11. Vision and Pattern Recognition. 1314–1321.
Michael Büttner and Simon Clavet. 2015. Motion Matching - The Road to Next Sergey Levine, Jack M Wang, Alexis Haraux, Zoran Popović, and Vladlen Koltun. 2012.
Gen Animation. In Proc. of Nucl.ai 2015. https://www.youtube.com/watch?v= Continuous character control with low-dimensional embeddings. ACM Transactions
z_wpgHFSWss&t=658s
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.
60:12 • Harvey et al.
on Graphics (TOG) 31, 4 (2012), 28. Xinyi Zhang and Michiel van de Panne. 2018. Data-driven autocompletion for keyframe
Zimo Li, Yi Zhou, Shuangjiu Xiao, Chong He, and Hao Li. 2017. Auto-Conditioned animation. In Proceedings of the 11th Annual International Conference on Motion,
LSTM Network for Extended Complex Human Motion Synthesis. arXiv preprint Interaction, and Games. ACM, 10.
arXiv:1707.05363 (2017).
Libin Liu and Jessica Hodgins. 2017. Learning to schedule control fragments for physics-
based characters using deep q-learning. ACM Transactions on Graphics (TOG) 36, 3 A APPENDIX
(2017), 29.
Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian Lu, Roger Zimmermann, and
A.1 Sliding critics
Li Cheng. 2019. Towards Natural and Accurate Future Motion Prediction of Humans
and Animals. In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
Paul Smolley. 2017. Least squares generative adversarial networks. In Proceed-
ings of the IEEE International Conference on Computer Vision. 2794–2802.
Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction
using recurrent neural networks. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. 2891–2900.
Jianyuan Min and Jinxiang Chai. 2012. Motion graphs++: a compact generative model
for semantic motion analysis and synthesis. ACM Transactions on Graphics (TOG)
31, 6 (2012), 153.
Jianyuan Min, Yen-Lin Chen, and Jinxiang Chai. 2009. Interactive generation of human
animation with deformable motion models. ACM Transactions on Graphics (TOG)
29, 1 (2009), 9.
Andrei Nicolae. 2018. PLU: The Piecewise Linear Unit Activation Function. arXiv
preprint arXiv:1809.09534 (2018).
Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. 2019. Model-
ing Human Motion with Quaternion-Based Neural Networks. International Journal Fig. 7. Visual summary of the two timescales critics. Blue frames are
of Computer Vision (2019), 1–18.
Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018. DeepMimic: the given contexts and green frames correspond to the transition. First and
Example-Guided Deep Reinforcement Learning of Physics-Based Character Skills. last critic positions are shown without transparency. At the beginning and
ACM Transactions on Graphics (Proc. SIGGRAPH 2018 - to appear) 37, 4 (2018). end of transitions, the critics are conditional in that they include ground-
Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. 2017. Deeploco: truth context in their input sequences. Scalar scores at each timestep are
Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM
Transactions on Graphics (TOG) 36, 4 (2017), 41. averaged to get the final score.
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville.
2018. Film: Visual reasoning with a general conditioning layer. In Thirty-Second A.2 Time-to-arrival embedding visualization
AAAI Conference on Artificial Intelligence.
Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the Convergence of Adam and
Beyond. In International Conference on Learning Representations. https://openreview.
net/forum?id=ryQu7f-RZ
Alla Safonova and Jessica K Hodgins. 2007. Construction and optimal search of in-
terpolated motion graphs. In ACM Transactions on Graphics (TOG), Vol. 26. ACM,
106.
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural state machine
for character-scene interactions. ACM Transactions on Graphics (TOG) 38, 6 (2019),
1–14.
Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. 2018. Long-term human motion
prediction by modeling motion context and enhancing motion dynamic. arXiv
preprint arXiv:1805.02513 (2018).
Jochen Tautges, Arno Zinke, Björn Krüger, Jan Baumann, Andreas Weber, Thomas
Helten, Meinard Müller, Hans-Peter Seidel, and Bernd Eberhardt. 2011. Motion
reconstruction using sparse accelerometer data. ACM Transactions on Graphics
(ToG) 30, 3 (2011), 18.
Graham W Taylor, Geoffrey E Hinton, and Sam T Roweis. 2007. Modeling human
motion using binary latent variables. In Advances in neural information processing
systems. 1345–1352.
Adrien Treuille, Yongjoon Lee, and Zoran Popović. 2007. Near-optimal character
animation with continuous control. ACM Transactions on Graphics (tog) 26, 3 (2007),
7.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In
Advances in neural information processing systems. 5998–6008.
Jack M Wang, David J Fleet, and Aaron Hertzmann. 2008. Gaussian process dynamical
models for human motion. IEEE transactions on pattern analysis and machine
intelligence 30, 2 (2008), 283–298.
Andrew Witkin and Michael Kass. 1988. Spacetime Constraints. In Proceedings of the
15th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH
’88). ACM, New York, NY, USA, 159–168. https://doi.org/10.1145/54852.378507
Yuting Ye and C Karen Liu. 2010. Synthesis of responsive motion using a dynamic
model. In Computer Graphics Forum, Vol. 29. Wiley Online Library, 555–562.
KangKang Yin, Kevin Loken, and Michiel van de Panne. 2007. Simbicon: Simple biped
locomotion control. In ACM Transactions on Graphics (TOG), Vol. 26. ACM, 105.
He Zhang, Sabastian Starke, Taku Komura, and Jun Saito. 2018. Mode-Adaptative Fig. 8. Visual depiction of time-to-arrival embeddings. Sub-figure (b)
Neural Networks for Quadruped Motion Control. ACM Transactions on Graphics shows the effect of using 𝑇max (ztta ), which in practice improves perfor-
(TOG) 37, 4 (2018).
mances when generalizing to longer transitions as it prevents initializing
the LSTM hidden state with novel embeddings.
ACM Trans. Graph., Vol. 39, No. 4, Article 60. Publication date: July 2020.