[go: up one dir, main page]

0% found this document useful (0 votes)
29 views29 pages

Sign Avatar

SignAvatars is the first large-scale 3D sign language motion dataset, comprising 70,000 videos and 8.34 million frames, designed to improve communication for Deaf and hard-of-hearing individuals. It features automated annotations for 3D body, hand, and face movements, enabling tasks such as 3D sign language recognition and production from various inputs. The dataset aims to bridge the gap in existing sign language technologies and facilitate further research in 3D sign language applications.

Uploaded by

karleesun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views29 pages

Sign Avatar

SignAvatars is the first large-scale 3D sign language motion dataset, comprising 70,000 videos and 8.34 million frames, designed to improve communication for Deaf and hard-of-hearing individuals. It features automated annotations for 3D body, hand, and face movements, enabling tasks such as 3D sign language recognition and production from various inputs. The dataset aims to bridge the gap in existing sign language technologies and facilitate further research in 3D sign language applications.

Uploaded by

karleesun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

SignAvatars: A Large-scale 3D Sign Language

Holistic Motion Dataset and Benchmark

Zhengdi Yu1,2 , Shaoli Huang2 ⋆


, Yongkang Cheng2 , and Tolga Birdal1
1
Imperial College London, London, United Kingdom
2
Tencent AI Lab, Shenzhen, China
arXiv:2310.20436v3 [cs.CV] 2 Jul 2024

Abstract. We present SignAvatars3 , the first large-scale, multi-prompt


3D sign language (SL) motion dataset designed to bridge the communica-
tion gap for Deaf and hard-of-hearing individuals. While there has been
an exponentially growing number of research regarding digital commu-
nication, the majority of existing communication technologies primarily
cater to spoken or written languages, instead of SL, the essential commu-
nication method for Deaf and hard-of-hearing communities. Existing SL
datasets, dictionaries, and sign language production (SLP) methods are
typically limited to 2D as annotating 3D models and avatars for SL is
usually an entirely manual and labor-intensive process conducted by SL
experts, often resulting in unnatural avatars. In response to these chal-
lenges, we compile and curate the SignAvatars dataset, which comprises
70,000 videos from 153 signers, totaling 8.34 million frames, covering
both isolated signs and continuous, co-articulated signs, with multiple
prompts including HamNoSys, spoken language, and words. To yield 3D
holistic annotations, including meshes and biomechanically-valid poses
of body, hands, and face, as well as 2D and 3D keypoints, we intro-
duce an automated annotation pipeline operating on our large corpus
of SL videos. SignAvatars facilitates various tasks such as 3D sign lan-
guage recognition (SLR) and the novel 3D SL production (SLP) from
diverse inputs like text scripts, individual words, and HamNoSys nota-
tion. Hence, to evaluate the potential of SignAvatars, we further propose
a unified benchmark of 3D SL holistic motion production. We believe
that this work is a significant step forward towards bringing the digital
world to the Deaf and hard-of-hearing communities as well as people
interacting with them.

1 Introduction

According to the World Health Organization, there are 466 million Deaf and
hard-of-hearing people [8]. Among them, there are over 70 million who commu-
nicate via sign languages (SLs) resulting in more than 300 different SLs across
different communities [54]. While the field of (spoken) natural language process-
ing (NLP) and language assisted computer vision (CV) are well explored, this is

Corresponding author.
3
https://signavatars.github.io/
2 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

SignAvatars

a) Data Collection
(i) 3D SLP from language (ii) 3D SLP from HamNoSys (iii) 3D SLP from word

b) Sign Language Applications


"hamfinger2345, hamthumbacrossmod,
“So the first drawing I did here is a girl with hamextfingeru, hampalmud, hammover, “Over”
very dark hair.” hamsmallmod, hamrepeatfromstart"

Fig. 1: Overview of SignAvatars, the first publicly available, large-scale multi-prompt


3D sign language holistic motion dataset. (upper row) We introduce a generic method
to automatically annotate a large corpus of video data. (lower row) We propose a 3D
SLP benchmark to produce plausible 3D holistic mesh motion and provide a neural
architecture as well as baselines tailored for this novel task.

not the case for the alternate and important communicative tool of SL, and ac-
curate generative models of holistic 3D avatars as well as dictionaries are highly
desired for efficient learning [37].
We argue that the lack of large scale, targeted SL datasets is an important
reason for this gap putting a barrier in front of downstream tasks such as digital
simultaneous SL translators. On one hand, existing SL datasets and dictionar-
ies [1,2,7,9,17,20] are typically limited to 2D videos or 2D keypoints annotations,
which are insufficient for learners [29] as different signs could appear to be the
same in 2D domain due to depth ambiguity. On the other hand, while paramet-
ric holistic models exist for human bodies [39] or bodies & faces [59], there is
no unified, large-scale, multi-prompt 3D holistic motion dataset with accurate
hand mesh annotations, which are crucial for SL. The reason for this is that
the creation of 3D avatar annotation for SL is a labor-intensive, entirely manual
process conducted by SL experts and the results are often unnatural [3].
To address this challenge, we begin by gathering various data sources from
public datasets of continuous online videos with mixed-prompt annotations in-
cluding HamNoSys, spoken language, and word and introduce the SignAvatars
dataset. Overall, we compile 70K videos from 153 signers amounting to 8.34M
frames. Unlike [14], our dataset is not limited to isolated signs, i.e. single sign
per video, where HamNoSys-annotations are present, but includes continuous
and co-articulated signs. To augment our dataset with 3D full-body annota-
tions, including 3D body, hand and face meshes as well as 2D & 3D keypoints,
SignAvatars 3

we design an automated and generic annotation pipeline, in which we perform a


multi-objective optimization over 3D poses and shapes of face, hands and body.
Our optimizer considers the temporal information of the motion and respects the
biomechanical constraints in order to produce accurate hand poses, even in pres-
ence of complex, interacting hand gestures. Apart from meshes and SMPL-X [39]
models, we also provide a hand-only subset with MANO [43] annotations.
SignAvatars enables multitude of tasks such as 3D sign language recognition
(SLR) or the novel 3D sign language production (SLP) from text scripts, individ-
ual words, and HamNoSys notation. To address the latter challenge and accom-
modate diverse forms of semantic input, we further propose a novel SLP baseline,
Sign-VQVAE, utilizing a semantic Variational Autoencoder (VQVAE) [56], ca-
pable of parallel linguistic feature generation (PLFG), effectively mapping the
various types of input data to discrete code indices. The output of PLFG mod-
ule is fused with a discrete motion encoder within an auto-regressive model to
generate sequences of code indices derived from these semantic representations,
strengthening the text-motion correlation. Consequently, our method can effi-
ciently generate sign motion from an extensive array of textual inputs, enhanc-
ing its versatility and adaptability to various forms of semantic information. We
will demonstrate in Sec. 5 that building such reliance and correlation between
the low-level discrete representations leads to accurate, natural and sign-motion
consistent SL production compared to direct regression from a high-level CLIP
feature.
Besides leveraging the existing benchmarks, to quantitatively & qualitatively
evaluate the potential of SignAvatars, we introduce a new SLP benchmark and
present the first results for 3D SL holistic mesh motion production from multiple
prompts including HamNoSys, spoken language, and word. On this benchmark,
we assess the performance of our Sign-VQVAE against the other baselines we
introduce, where we show a relative improvement of 200%. Though, none of the
assessed models can truly match the desired accuracy, confirming the timeliness
and the importance of SignAvatars.
As depicted in Fig. 1, our contributions are as follows:
– We introduce SignAvatars, the first large-scale multi-prompt 3D holistic mo-
tion SL dataset, containing diverse forms of semantic input.
– We provide accurate annotations for SignAvatars, in the form of expressive 3D
avatar meshes. We do so by utilizing a multi-objective optimization capable
of dealing with the complex interacting hands scenarios, while respecting the
biomechanical hand constraints. We initialize this fitting procedure by a novel
multi-stage, hierarchical process.
– We provide a new 3D sign language production (SLP) benchmark for SignA-
vatars, considering multiple prompts and full-body meshes.
– We further develop a VQVAE-based strong 3D SLP network significantly out-
performing the baselines, which are also introduced as part of our work.
We believe SignAvatars is a significant stepping stone towards bringing the 3D
digital world and 3D SL applications to the Deaf and hard-of-hearing communi-
ties, by fostering future research in 3D SL understanding.
4 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

2 Related Work

3D holistic mesh reconstruction. Recovering holistic 3D human body avatars


from RGB videos and parsing them into parametric forms like SMPL-X [39] or
Adam [23] is a well explored area [32, 39, 59]. Arctic [12] introduces a full-body
dataset annotated by SMPL-X, for 3D object manipulation. [18] provide a hand-
object constellations datasets with MANO annotations. However, such expres-
sive parametric models (like TalkShow [59] or OsX [32]) have rarely been applied
to the SL domain. [27] use off-the-shelf methods to estimate a holistic 3D mesh on
existing dataset [55] but cannot deal with the challenging occlusions and interac-
tions, making them unsuitable for complex, real scenarios. SignBERT+ [19] pro-
posed the first self-supervised pre-trainable framework with model-aware hand
prior for sign language understanding (SLU). The latest concurrent work [14]
can reconstruct 3D holistic mesh for SL videos using linguistic priors with group
labels obtained from a sign-classifier trained on Corpus-based Dictionary of Pol-
ish Sign Language (CDPSL) [33], which is annotated with HamNoSys As such, it
utilizes an existing sentence segmentation methods [42] to generalize to multiple-
sign videos. Overall, the literature lacks a robust yet generic method handling
continuous and co-articulated SL videos with complex hand interactions.
SL datasets. While there have been many well-organized continuous 2D SL
motion datasets [1, 2, 7, 9, 17, 20], the only existing 3D SL motion dataset with
3D holistic mesh annotation is in [14]. As mentioned, this rather small dataset
only includes a single sign per video only with HamNoSys-prompts. In contrast,
SignAvatars provides a multi-prompt 3D SL holistic motion dataset with con-
tinuous and co-articulated signs and fine-grained hand mesh annotations.
SL applications. [4] can generate 2D motion sequences from HamNoSys. [47]
and [49] are able to generate 3D keypoint sequences relying on glosses. The
avatar approaches are often hand-crafted and produce robotic and unnatural
movements. Apart from them, there are also early avatar approaches [5, 10, 11,
15, 64] with a pre-defined protocol and character. To the best of our knowledge,
we present the first large-scale 3D holistic SL motion dataset, SignAvatars. Built
upon the dataset, we also introduce the novel task and benchmark of 3D sign
language production, through different prompts (language, word, HamNoSys).

3 SignAvatars Dataset
Overview. SignAvatars is a holistic motion dataset composed of 70K video
clips having 8.34M frames in total, containing body, hand and face motions
as summarized in Tab. 2. We compile SignAvatars by gathering various data
sources from public datasets to online videos and form seven subsets, whose
distribution is reported in Fig. 2. Since the individual subsets do not naturally
contain expressive 3D whole-body motion labels and 2D keypoints, we introduce
a unified automatic annotation framework providing rich 3D holistic parametric
SMPL-X annotations along with MANO subsets for hands. Overall, we provide
117 hours of 70K video clips with 8.34M frames of motion data with accurate
expressive holistic 3D mesh as motion annotations.
SignAvatars 5

Data Video Frame Duration (hours) Co-articulated Pose Annotation (to date) Signer
RWTH-Phoenix-2014T [7] 8.25K 0.94M 11 C - 9
DGS Corpus [17] - - 50 C 2D keypoints 327
BSL Corpus [50] - - 125 C - 249
MS-ASL [24] 25K - 25 I - 222
WL-ASL [30] 21K 1.39M 14 I 2D keypoints 119
How2Sign [9] 34K 5.7M 79 C 2D keypoints, depth* 11
CSL-Daily [20] 21K - 23 C 2D keypoints, depth 10
SIGNUM [57] 33K - 55 C - 25
AUTSL [51] 38K - 21 I depth 43
SGNify [14] 0.05K 4K - I body mesh vertices -
SignAvatars (Ours) 70K 8.34M 117 Both SMPL-X, MANO, 2D&3D keypoints 153

Table 1: Modalities of publicly available sign language datasets. C, I represent


isolated and co-articulated (continuous) separately. * means the annotation has not
been released yet. To the best of our knowledge, our dataset is the first publicly available
3D SL holistic continuous motion dataset with whole-body and hand mesh annotations
with the most parallel modalities.

3.1 Dataset Characteristics

Expressive motion representation. To fill in the gaps of previous 2D-only


SL data, our expressive 3D holistic body annotation consists of face, hands, and
body, which is achieved by adopting SMPL-X [39]. It uses standard vertex-based
linear blend skinning with learned corrective blend shapes and has N = 10475
vertices and K = 67 joints. For time interval [1 : t], V1:T = (v1 , ..., vt ), J1:T =
(j1 , ..., jt ), θ1:T = (θ1 , ..., θt ), represent mesh vertices, 3d joints, and poses in
6D representation [63], respectively. Here the pose θt includes the body pose
θtb ∈ R23×6 with global orientation and the hand pose θth ∈ R30×6 . Moreover,
θtf ∈ R6 and ϕ represents the yaw pose and facial expressions respectively. For
each of the sequences, we use an optimized consistent shape parameter β̃ as there
is no signer change within each clip. Overall, a motion state Mt is represented as:
Mt = (θtb , θth , θtf , ϕ, β̃). Moreover, as shown in Tab. 1, our dataset also provides
a hand motion subset by replacing the parametric representation from SMPL-
X to MANO [43]: Mth = (θth , β̃), where h is the handed-ness and β̃ is also an
optimized consistent shape parameter.
Sign language notation. Similar to spoken languages, sign languages have
special structures with a set of linguistic rules [6] (e.g. grammar, lexicons). Unlike
spoken languages, they have no standard written forms. Moreover, there are over
300 different sign languages across the world, with Deaf and hard-of-hearing
people who do not know any SL. Hence, having only a single type of annotation
is insufficient in practice. To enable more generic applications targeting different
users, our SL annotations include various modalities that can be categorized into
four common types: HamNoSys, spoken language, word, and gloss, which can be
used for a variety of downstream applications such as SLP and SLR.
Data sources. As shown in Tab. 2, SignAvatars leverages our unified auto-
matic annotation framework to collect SL motion sequences in diverse modal-
ities from various different sources. Specifically, for co-articulated SL datasets
like How2Sign [9] and How2 [45] with American Sign Language (ASL) transcrip-
tions, we collect sentence-level clips from the Green Screen studio subset with
6 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

multi-view frames, resulting in 34K clips for the ASL subset. For GSL subset,
we mostly gathered data from the publicly available PHOENIX14T dataset [7]
following the official split to have 8.25K video clips. For HamNoSys subset,
we collect 5.8K isolated-sign SL video clips from Polish SL corpus [34] for PJM,
and German Sign Language (DGS), Greek Sign Language (GRSL) and French
Sign Language (LSF) from DGS Corpus [40] and Dicta-Sign [35]. We finally
gathered 21K clips from word-level sources such as WLASL [30] to curate the
isolated-sign word subset. Overall, we divide our dataset into four subsets: (i))
word, (ii) ASL, (iii) HamNoSys, (iv) GSL based on the prompt categories as
shown in Fig. 2.
12000
Word
ASL
Data Video Frame Type Signer
10000 HamNoSys
GSL Word 21K 1.39M W 119
8000
PJM 2.6K 0.21M H 2
Number of clips

DGS 1.9K 0.12M H 8


6000 GRSL 0.8K 0.06M H 2
LSF 0.4K 0.03M H 2
4000
ASL 34K 5.7M S 11
GSL 8.3K 0.83M S, SG 9
2000
Ours 70K 8.34M S, H, W, SG 153
0 180-240 240-300 >300
0-60 60-120 120-180
Number of frames
Table 2: Statistics of data sources.
Fig. 2: Distribution of subsets. The number of W, H, S, SG represent word,
frames for each clip in different subsets. PJM, HamNoSys, sentence-level spoken
LSF, DGS and GSL are gathered in one group. language and sentence-level gloss.

3.2 Automatic Holistic Annotation

To efficiently auto-label the SL videos with motion data given only RGB online
videos, we design an automatic 3D SL annotation pipeline that is not limited to
isolated signs. To ensure motion stability and 3D shape accuracy, while main-
taining efficiency during holistic 3D mesh recovery from SL videos, we propose
an iterative fitting algorithm minimizing an objective heavily regularized both
holistically and by biomechanical hand constraints [52]:

\begin {split} E(\theta , \beta , \phi ) = \lambda _{J} L_{J} + \lambda _{\theta } L_{\theta } + \lambda _{\alpha } L_{\alpha } + \lambda _{\beta } L_{\beta } + \lambda _{s}L_{\mathrm {smooth}} + \lambda _{a}L_{\mathrm {angle}} + L_{\mathrm {bio}} \end {split} (1)

where θ is the full set of optimizable pose parameters, and ϕ is the facial ex-
pression. LJ represents the joint loss of 2D re-projection, which optimizes the
difference between joints extracted from the SMPL-X model, projected into the
image, with joints predicted with ViTPose [58] and MediaPipe [25]. The 3D joints
can be jointly optimized in LJ when GT is available. Lθ is the pose prior term
following SMPLify-X [39]. Moreover, Lα is a prior penalizing extreme bending
only for elbows and knees and Lβ is the shape prior term. In addition, Lsmooth ,
Langle and Lbio are the smooth-regularization loss, angle loss and biomechanical
constraints, separately. Finally, each λ denotes the influence weight of each loss
term. Please refer to the appendix for more details. In what follows, we describe
in detail our regularizers.
SignAvatars 7

RGB Inputs Bio constraints


θb, s, t
PARE Kbody Khand Kface

Optimization
Projection
SMPL-X
……

Detector
θb, θh, β, cam, φ fusion

OSX
……
θh
ACR Kbody Khand Kface Mt = (θb, θh, β, cam, φ)
Initialization Optimization & Reg
ViTPose MediaPipe
2D Pose Estimation

Fig. 3: Overview of our automatic annotation pipeline. Given an RGB image sequence
as input, we perform a hierarchical initialization, followed by an optimization involving
temporal smoothness and biomechanical constraints. Finally, our pipeline outputs the
final 3D motion results as a sequence of SMPL-X parameters.

Holistic regularization. To reduce the jitter on body and hand motion, caused
by the noisy 2D detected keypoints, we employ a smoothness term defined as:

L_{\mathrm {smooth}} = \sum _{t}\left (\|\hat {\theta }^{b}_{1:T}\|_{2} + \|\Tilde {\theta }_{1:T}^{h}\|_{2} + \| \theta ^{h}_{2:T} - \theta ^{h}_{1:T-1} \|_{2} + \| \theta ^{b}_{2:T} - \theta ^{b}_{1:T-1} \|_{2}\right ) (2)

b
where θ̂1:T ∈ RN ×jb ×3 is the selected subset of pose parameters from θ1:T b

RN ×J×3 , and N is the frame number of the video. θ̃h ∈ RN ×jh is the selected
subset of hand parameters from θ1:T b
∈ RN ×J×3 . jb and jh are the numbers of
selected body joints and hand parameters, Moreover, this could prevent implau-
sible poses along the bone direction such as twists. Additionally, we penalize the
hand poses lying outside the plausible range by adding an angle limit prior term:

L_{\mathrm {angle}} = \sum _{t}(\mathcal {I}(\|\theta ^{h}_{1:T}\|_{2}; \theta _{\mathrm {min}}^{h}, \theta _{\mathrm {max}}^{h}) + \mathcal {I}(\| \theta ^{b}_{1:T}\|_{2}; \theta _{\mathrm {min}}^{b}, \theta _{\mathrm {max}}^{b})) (3)

h,b h,b
where I is the interval loss penalizing the outliers, θmin , θmax is the pre-defined
interval, θh , θb is the selected subset of holistic poses. Finally, the signer in each
video clip will not change, so we can use the optimized consistent shape
parameters β to represent the holistic body shape. Specifically, our fitting
procedure is split into five stages, where we will optimize the shape for the first
three stages of optimization to derive the mean shape and freeze the shape in
the following stages.
Biomechanical hand constraints. Hand pose estimation from monocular
RGB images is challenging due to fast movements, interaction, frequent occlu-
sion and confusion. To further improve the hand motion quality and eliminate
implausible hand pose, we apply biomechanically constrain the hand poses, using
three losses: (i) Lbl for bone length, (ii) Lpalm for palmar region optimization,
and (iii) Lja for joint angle priors. Specifically, the final biomechanical loss Lbio
is defined as the weighted sum Lbio = λbl Lbl + λpalm Lpalm + λja Lja , with:

\centering \begin {split} &L_{\mathrm {bl}} = \sum _{i}\mathcal {I}(\|b^{i}_{1:T}\|_{2}; b_{\mathrm {min}}^{i}, b_{\mathrm {max}}^{i}), \quad L_{\mathrm {ja}} = \sum _{i}D(\alpha _{1:T}^{i}, H^{i})\\ &L_{\mathrm {palm}} = \sum _{i}(\mathcal {I}(\|c^{i}_{1:T}\|_{2}; c_{\mathrm {min}}^{i}, c_{\mathrm {max}}^{i}) + \mathcal {I}(\|d^{i}_{1:T}\|_{2}; d_{\mathrm {min}}^{i}, d_{\mathrm {max}}^{i})), \end {split}
(4)
8 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

where I is the interval loss penalizing the outliers, bi is the bone length of ith
finger bone and the optimization constraints the whole sequence [1 : T ]. We
further constrain the curvature and angular distance for the four root bones
supporting the palmar structures by penalizing the outliers of curvature range
cimax , cimin and angular distance range dimax , dimin . Inspired by [52], we also apply
i f a
constraints to the sequence of joint angles α1:T = (α1:T , α1:T ) by approximating
f a i
the convex hull on (α , α ) plane with point set H and minimizing their distance
D, where (αf , αa ) is the flexion and abduction angles. The biomechanical loss is
then computed as the weighted sum of them: Lbio = λbl Lbl +λpalm Lpalm +λja Lja .
We refer the reader to our appendix for more details.
Hierarchical initialization. Given an RGB image sequence, we initialize the
holistic SMPL-X parameters from OSX [32]. Though, due to the frequent occlu-
sion and hand interactions, OSX is not always sufficient for a good initialization.
Therefore, we further fuse OSX with ACR [60], PARE [26] to improve stability
under occlusion and truncation. For 2D holistic keypoints initialization, we first
train a whole-body 2D pose estimation model on COCO-WholeBody [22] based
on ViTPose [58] and subsequently incorporate it with MediaPipe [25] by fusing
and feeding through a confidence-guided filter.

4 SignVAE: A Strong 3D SLP Baseline


Our SignAvatars dataset enables the first applications to generate high-quality
and natural 3D sign language holistic motion along with 3D meshes from both
isolated and continuous SL prompts. To achieve this goal, motivated by the
fact that the text prompts are highly correlated and aligned with the motion
sequence, our method consists of a two-stage process designed to enhance the
understanding of varied inputs by focusing on both semantic and motion as-
pects. In the first stage, we develop two codebooks - a shared semantic codebook
and a motion codebook - by employing two Vector Quantized Variational Auto-
Encoders (VQ-VAE). This allows us to map various kinds of input data to their
corresponding semantic code indices and link motion elements to motion code
indices. In the second stage, we utilize an auto-regressive model to generate mo-
tion code indices based on the previously determined semantic code indices. This
integrated approach ensures a coherent and logical understanding of the input
data, effectively capturing both the semantic and motion-related information.
SL motion generation. To produce stable and natural holistic poses in space
and time, instead of directly mapping prompts to motion, we leverage the gen-
erative model VQ-VAE as our SL motion generator. As illustrated in Fig. 4,
our SL motion VQVAE consists of an autoencoder structure and a learnable
codebook Zm , which contains I codes Zm = {zi }Ii=1 with zi ∈ Rdz . We first
encode the given 3D SL motion sequence M1:T = (θTb , θTh , θTf , ϕ), where T is the
m m m dz
motion length, into a latent feature F1:(T /w) = (f1 , ..., f1:(T /w) ) ∈ R , where
w = 4 is used as the downsampling rate for the window size. Subsequently, we
quantize the latent feature embedding by searching for the nearest neighbour
code in the codebook Zm . For the j th feature, the quantization code is found
SignAvatars 9

Linguistic Feature Linguistic Feature


Semantic Inputs Train
𝐸! 𝐸! ’
Shared Semantic Codebook 𝑍"
Inference

Embedding
or Train &
Linguistic … Decoder
“How are you?” Inference
Encoder
or Fusion
code index
“Walk” End token

autoregressive

Embedding
Semantic Autoregressive … …
Adapter Transformer

code index vector …

Motion … Decoder
Encoder
Motion Codebook 𝑍!
Holistic Motion Holistic Motion
𝑀1:t 𝑀" 1:t

Fig. 4: Our 3D SLP network, SignVAE, consists of two-stages. We first create semantic
and motion codebooks using two VQ-VAEs, mapping inputs to their respective code
indices. Then, by an auto-regressive model, we generate motion code indices based on
semantic code indices, ensuring a coherent understanding of the data.

by: fˆjm = arg min ∥fjm − zi ∥2 . Finally, the quantized latent features are fed into
zi ∈Z
decoders for reconstruction. In terms of the training of the SL motion generator,
we apply the standard optimization scheme with Lmotionv q :

L_{m-vq} = L_{recon}(M_{1:T}, \hat {M}_{1:T}) + \| sg[F_{1:T}^{m}] - \hat {F_{1:T}^{m}}\|_{2} + \beta \| F_{1:T}^{m} - \mathrm {sg}[\hat {F_{1:T}^{m}}]\|_{2} (5)

where Lrecon is the MSE loss and β is a hyper-parameter. sg is the detach


operation to stop the gradient. We provide more details regarding the network
architecture and training in our appendix.
Prompt feature extraction for parallel linguistic feature generation.
For efficient learning, typical motion generation tasks usually leverage an LLM
to produce linguistic prior (condition) c given an input prompt. In our task of
spoken language and word-level annotation, we leverage CLIP [41] as our prompt
encoder to obtain the text embedding E l . However, this does not extend to all
the other SL annotations we desire. As a remedy, to enable applications with dif-
ferent prompts such as HamNoSys, instead of relying on the existing pre-trained
CLIP, we define a new prompt encoder for embedding. After quantizing the
prompt (e.g. HamNoSys glyph) into tokens with length s, we use an embedding
layer to produce the linguistic feature Eˆ1:s
l = (êl , ..., êl ) with same dimension d
1 s l
as the text embeddings of CLIP [41]. For simplicity, we use "text" to represent
all different input prompts. Subsequently, motivated by the fact that the text
prompts are highly correlated and aligned with the motion sequence, we propose
a linguistic VQVAE as our parallel linguistic feature generator (PLFG) module
coupled with the SL motion generator. In particular, we leverage a similar quan-
tization process using the codebook Zl and training scheme as in the SL motion
10 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

Table 3: Quantitative comparisons on EHF dataset. *, † ‡ denote the optimization-


based, regression-based method, and hybrid methods, respectively.

MPVPE PA-MPVPE PA-MPJPE


Method
Holistic Hands Face Holistic Hands Face Body Hands
SMPLify-X [39]∗ - - - 65.3 75.4 12.3 62.6 12.9
FrankMocap [44]† 107.6 42.8 - 57.5 12.6 - 62.3 12.9
PIXIE [13]† 89.2 42.8 32.7 55.0 11.1 4.6 61.5 11.6
Hand4Whole [36]† 76.8 39.8 26.1 50.3 10.8 5.8 60.4 10.8
PyMAF-X [13]† 64.9 29.7 19.7 50.2 10.2 5.5 52.8 10.3
OSX [32]† 70.8 - - 48.7 - - 55.6 -
Motion-X [31]‡ 44.7 - - 31.8 - - 33.5 -
Motion-X w/GT 3Dkpt [31]‡ 30.7 - - 19.7 - - 23.9 -
Ours (w/o bio)∗ 21.6 12.5 7.8 14.2 5.4 4.3 16.5 6.2
Ours∗ 20.1 9.7 7.8 12.9 4.7 4.3 15.6 5.8

generator to yield linguistic features:

L_{l-vq} = L_{recon}(E^{l}_{1:s}, \hat {E}^{l}_{1:s}) + \| sg[F^{l}_{1:s}] - \hat {F^{l}_{1:s}}\|_{2} + \beta \| F^{l}_{1:s} - sg[\hat {F^{l}_{1:s}}]\|_{2} (6)

l
where F1:s is the latent feature after encoding the initial linguistic feature. Fˆ1:sl

is the quantized linguistic feature after applying fˆjl = arg max ∥fjl − zi ∥2 to F1:s
l
.
zi ∈Zl
Sign-motion cross modelling and production. After training the VQVAE-
based SL motion generator, we can map any motion sequence M1:T to a sequence
of indices X = [x1 , ..., xT /w , xEOS ] through the motion encoder and quantization,
where xEOS is a learnable end token representing the stop signal. After training
both the SL motion generator and the linguistic feature generator, our network
will be jointly optimized in a parallel manner. Specifically, we fuse the linguistic
feature embedding El and the codebook index vectors of Zl to formulate the final
condition for our autoregressive code index generator. The objective for training
the code index generator can be seen as an autoregressive next-index prediction
task, learned with a cross − entropy loss between the likelihood of the full pre-
dicted code index sequence and the real ones as LSLP = EX∼p(X) [− log p(X|c)].
Lastly, with the quantized motion representation, we generate the codebook
vectors in a temporal autoregressive manner and predict the distribution of the
next codebook indices given an input linguistic prompt as linguistic condition c.
After mapping the codebook indices X to the quantized motion representation
m
F̂1:(T /w) , we are able to decode and produce the final 3D holistic motion with
mesh representations M1:T .

5 Experimental Evaluation

We now showcase the effectiveness of our contributions individually, namely,


the 3D reconstruction and annotation pipeline as well as the 3D sign language
production. Note that, with SignAvatars we present the first benchmark results
for 3D holistic SL motion production yielding mesh representations.
SignAvatars 11

Input Ours OSX PIXIE


Fig. 6: Comparing 3D holistic human mesh reconstruction methods on EHF
dataset [39]. Our annotation method produces significantly better holistic reconstruc-
tions with plausible poses and the best pixel alignment. (Zoom in for a better view)

5.1 Evaluating the Annotation Pipeline


We start by assessing our optimization-based automatic annotation approach.
Due to the availability of ground truth, we evaluate our method against the state-
of-the-art hand and holistic human mesh recovery methods including SMPLify-
x [39], OSX [32], PyMAF-X [61], PIXIE [13], on the standard EHF dataset [39].
Evaluation metrics. For quantitative evaluation, we follow the prior works and
compute per-vertex error (MP-VPE), mean per-vertex error after Procrusters
alignment (PA-MPVPE), and mean per-joint error after Procrusters alignment
(PA-MPJPE).
Results. It can be seen from Tab. 3
that our method significantly sur-
Input

passes the leading monocular holis-


tic reconstruction methods by a large
PyMAF-X

margin. Notably, our PA-MPJPE we


achieve a 40% improvement over
SoTA [31]. Specifically, our hand re-
PIXIE

construction error drops down to 4.7


on PA-MPVPE when the biomechan-
ical constraints are integrated. The
Ours

qualitative results presented in Fig. 12


validates our superior reconstruction Fig. 5: Comparison of 3D holistic
quality. These results consistently body reconstruction. The results from
translate to our dataset, SignAvatars, PIXIE [13], PyMAF-X [61], and Ours on
as shown by additional qualitative re- our dataset, SignAvatars.
sults in Fig. 5. It is seen that our method yields significantly more natural body
movement, as well as accurate hand poses and better pixel-mesh aligned body
shapes (β).

5.2 3D SL Motion Generation on the SignAvatars Benchmark


We now evaluate the generative capabilities of our SignVAE model on the Sig-
nAvatars dataset and provide ablations studies on the PFLG module.
Evaluation metrics. To fully assess the quality of our motion generation, we
evaluate the holistic motion as well as the arm motion4 . Based on an evaluation
4
The lower body is not factored in our evaluations as it is unrelated to the SL motion.
12 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

Sentence-level spoken language Word HamNoSys

Fig. 7: Output of our reconstruction-based annotation pipeline for different types of


input. Specifically, we present examples from the subsets of SignAvatars (left) sentence-
level spoken language from ASL subset, (mid) HamNoSys-level examples from Ham-
NoSys-subset, and (right) word-level examples of the word -subset. .

model trained following prior arts in motion generation [53,62], we use the scores
and metrics of FID, Diversity, Multimodality (MM), MM-Dist, MR-precision,
whose details are provided in our supplementary material. Unfortunately, there
is no de-facto standard for evaluating 3D SLP in the literature. While [29] is
capable of back-translating 3D SL motion by treating it as a classification, it
is tailored only for word-level back-translation. While BLEU and ROUGE are
commonly used in the back-translation evaluation [47, 49], they are not generic
for other types of annotations such as HamNoSys or glosses. Since the generated
motion might differ in length from the real motion, absolute metrics like MPJPE
would also be unsuited. Inspired by [4, 21], we propose a new MR-Precision
for motion retrieval as well as DTW-MJE (Dynamic Time Warping - Mean
Joint Error) [28] with standard SMPL-X keypoint set without lower body, for
evaluating the performance of our method as well as the baselines.
Subsets & training settings. Specifically, we report results on three represen-
tative subsets: (i) the complete set of ASL for spoken language (corresponding
to language in Tab. 6), (ii) the word subset with 300 vocabularies, (iii) combined
subset of DGS, LSF, PJM, and GRSL for HamNoSys. For training, we follow
the official splits for (i) and (ii). For (iii), we leverage a four-fold strategy where
we train on three of them and test on the other, repeated four times to have the
final results.
Benchmarking & results. To the best of our knowledge, there is no publicly
available benchmark for 3D mesh & motion-based SLP5 . To evaluate SignAvatars
as the first 3D motion-based SLP benchmark, we present detailed quantitative
results in Tab. 6. It can be seen that the 3D SLP with word-level prompts
can achieve the best performance reaching the quality of real motions. Learning
from spoken languages is a naturally harder task and we invite the community to
develop stronger methods to produce 3D SLP from spoken languages. To further
evaluate the sign accuracy and effect of body movement, we report separate
results for individual arms (e.g. ”Gesture”), with slight improvements in FID

5
[46] does not provide a public evaluation model as discussed in our Appendix.
SignAvatars 13

R-Precision (↑) MR-Precision (↑)


Data Type FID (↓) Div. (→) MM (→) MM-dist(↓)
top 1 top 3 top 5 top 1 top 3 top 5
Language 0.375±.005 0.545±.007 0.679±.008 0.061±.153 12.11±.075 - 3.786±.057 - - -
Real ±.002 ±.006
motion HamNoSys 0.455 0.689 0.795±.004 0.007±.062 8.754±.028 - 2.113±.023 - - -
Word-300 0.499±.003 0.811±.002 0.865±.003 0.006±.054 8.656±.035 - 1.855±.019 - - -
Language 0.265±.007 0.413±.008 0.531±.0059 4.359±.389 12.35±.101 3.451±.107 4.851±.067 0.356±.007 0.525±.007 0.645±.009
Holistic HamNoSys 0.429±.004 0.657±.005 0.756±.002 0.884±.035 9.451±.087 0.941±.056 2.651±.027 0.552±.002 0.745±.010 0.813±.034
Word-300 0.475±.002 0.731±.003 0.815±.005 0.756±.021 8.956±.091 0.815±.059 2.101±.024 0.615±.005 0.797±.006 0.875±.002
Language 0.245±.008 0.405±.009 0.519±.010 3.951±.315 10.12±.121 3.112±.135 5.015±.089 0.375±.011 0.535±003 0.668±.004
Gesture HamNoSys 0.435±.005 0.649±.004 0.745±.006 0.851±.033 8.944±.097 0.913±.036 2.876±.015 0.581±.004 0.736±.006 0.825±.008
Word-300 0.465±.001 0.711±.003 0.818±.003 0.715±.016 8.235±.055 0.801±.021 2.339±.027 0.593±.006 0.814±.005 0.901±.006

Table 6: Quantitative evaluation results for the 3D holistic SL motion generation. Real
motion denotes the motions sampled from the original holistic motion annotation in the
dataset. Holistic represents the generation results regarding holistic motion. Gesture
stands for the evaluation conducted on two arms. Div. refers to Diversity.

and MR-Precision. However, it will also degenerate the text-motion consistency


(R-Precision and MM-dist) due to the absence of body-relative hand position.
Due to the lack of works that can generate 3D holistic SL motion with mesh
representation from any of the linguistic sources (e.g. spoken language, Ham-
NoSys, gloss, ...), we modify the latest HamNoSys-based SLP work, Ham2Pose [4]
(Ham2Pose-3d in Tab. 4), as well as MDM [53] (corresponding to SignDif-
fuse in Tab. 5), to take our linguistic feature as input and to output SMPL-X
representations and evaluate on our dataset. We then train our SignVAE and
Ham2Pose-3d along with the original Ham2Pose on their official split and use
DTW-MJE for evaluation. Specifically, we also regress the keypoints from our
holistic representation M1:T to align with the Ham2Pose 2D skeleton. As dis-
covered in this benchmark, leveraging our SignAvatars dataset can easily enable
more 3D approaches and significantly improve the existing SLP applications by
simple adaptation compared to the original Ham2Pose. The results in Tab. 4
are reported on the HamNoSys holistic set for comparison. While our method
drastically improves over the baseline, the result is far from ideal, motivating
the need for better models for this new task.

Table 4: Comparison with state-of-the- Table 5: Quantitative ablation study of


art SLP methods from HamNoSys holistic SignVAE on HamNoSys holistic subset for
subset. * represents using only 2D cues. comparison with prior arts.
DTW-MJE Rank (↑) Method
R-Precision (↑)
MM-dist(↓)
Method top 1 top 3 top 5
top 1 top 3 top 5
Ham2Pose-3d 0.291±.006 0.386±.005 0.535±.005 3.875±.086
Ham2Pose* 0.092±.031 0.197±.029 0.354±.032 SignDiffuse 0.285±.003 0.415±.005 0.654±.003 3.866±.054
Ham2Pose-3d 0.253±.036 0.369±.039 0.511±.035 SignVAE(Base) 0.385±.008 0.613±.009 0.745±.007 3.056±.108
SignVAE (Ours) 0.516±.039 0.694±.041 0.786±.035 SignVAE(Ours) 0.429±.009 0.657±.008 0.756±.008 2.651±.119

Fig. 8 shows qualitative results from continuous 3D holistic body motion gen-
eration. As observed, our method can generate plausible and accurate holistic 3D
motion from a variety of prompts while containing some diversity enriching the
production results. We provide further examples on our supplementary material.
Ablation on PFLG. In order to study our unique text-sign cross-modeling
module, we introduce a baseline, SignVAE (Base), replacing the PLFG with a
canonical pre-trained CLIP feature as input to the encoder. As shown in Tab. 5,
14 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

“hamsymmpar, hamparbegin,
“hamsymmpar, hamflathand, hamparbegin, hamfinger2, hamextfingerl, hampalmdl,
hamextfingerul, hampalmr, hamplus, hamplus, hamflathand, hamaltbegin,
hamextfingerul, hampalmr, hamparend, hamextfingeru, hampalmu, hammetaalt,
Input “So this is a really important tool to have in doing your photography.” hamclose, hamparbegin, hammoveor, hamextfingeru, hampalmd, hamaltend,
hamreplace, hamextfingero, hamparend.” hamparend, hamtouch, hammoveo.”

Generated
Motion

Ground
Truth
Video

“Well, regardless of where you find the monologue, if you put it together or “league” “notice”
Input even if you write it yourself, it is important that the monologue makes sense.

Generated
Motion

Ground
Truth
Video

Fig. 8: Qualitative results of 3D holistic SLP from different prompts (left row: spoken
language, top right: HamNoSys, bottom right: word). Within each sample, the first two
rows are the input prompts and the generated results. The last row is the corresponding
video clip from our dataset.

our joint scheme utilizing the PLFG can significantly improve the prompt-motion
consistency, resulting in an increase in R-precision and MM-dist. Moreover, our
VQVAE backbone quantizing the motion representation into a motion codebook,
enables interaction with the linguistic feature codebook, leading to significant
improvements in prompt-motion correspondences and outperforms other base-
lines built with our linguistic feature generator (SignDiffuse, Ham2Pose-3d) and
generates more text-motion consistent results.

6 Conclusion
We introduced SignAvatars, the first large-scale 3D holistic SL motion dataset
with expressive 3D human and hand mesh annotations, provided by our auto-
matic annotation pipeline. SignAvatars enables a variety of application potentials
for Deaf and hard-of-hearing communities. Built upon our dataset, we proposed
the first 3D sign language production approach to generate natural holistic mesh
motion sequences from SL prompts. We also introduced the first benchmark re-
sults for this new task, continuous and co-articulated 3D holistic SL motion
production from diverse SL prompts. Our evaluations on this benchmark clearly
showed the advantage of our new VQVAE-based SignVAE model, over the base-
lines, we develop.
Limitations and future work. Having the first benchmark at hand opens
up a sea of potential in-depth investigations of other 3D techniques for 3D SL
motion generation. Especially, the lack of a sophisticated and generic 3D back-
translation method may prevent our evaluations from fully showcasing the su-
periority of the proposed method. We leave this for a future study. Moreover,
combining 3D SLT and SLP to formulate a multi-modal generic SL framework
will be one of the future works. Developing a large 3D sign language motion
model with more properties and applications in AR/VR will significantly ben-
efit the Deaf and hard-of-hearing people around the world, as well as countless
hearing individuals interacting with them. As such, We invite the research com-
munity to develop even stronger baselines.
SignAvatars 15

References
1. Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J.S., Fox, N., Zisserman,
A.: BSL-1K: Scaling up co-articulated sign language recognition using mouthing
cues. In: European Conference on Computer Vision (2020) 2, 4, 26
2. Albanie, S., Varol, G., Momeni, L., Bull, H., Afouras, T., Chowdhury, H., Fox, N.,
Woll, B., Cooper, R., McParland, A., Zisserman, A.: BOBSL: BBC-Oxford British
Sign Language Dataset. https://www.robots.ox.ac.uk/~vgg/data/bobsl (2021)
2, 4, 26
3. Aliwy, A.H., Ahmed, A.A.: Development of arabic sign language dictionary using
3d avatar technologies. Indonesian Journal of Electrical Engineering and Computer
Science 21(1), 609–616 (2021) 2, 25
4. Arkushin, R.S., Moryossef, A., Fried, O.: Ham2pose: Animating sign language no-
tation into pose sequences. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 21046–21056 (2023) 4, 12, 13, 26
5. Bangham, J.A., Cox, S., Elliott, R., Glauert, J.R., Marshall, I., Rankov, S., Wells,
M.: Virtual signing: Capture, animation, storage and transmission-an overview
of the visicast project. In: IEE Seminar on speech and language processing for
disabled and elderly people (Ref. No. 2000/025). pp. 6–1. IET (2000) 4, 26
6. Blaisel, X.: David f. armstrong, william c. stokoe, sherman e. wilcox, gesture and
the nature of language, cambridge, cambridge university press, 1995, x+ 260 p.,
bibliogr., index. Anthropologie et Sociétés 21(1), 135–137 (1997) 5
7. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language
translation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 7784–7793 (2018) 2, 4, 5, 6, 22, 26
8. Davis, A.C., Hoffman, H.J.: Hearing loss: rising prevalence and impact. Bulletin of
the World Health Organization 97(10), 646 (2019) 1
9. Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F.,
Torres, J., Giro-i Nieto, X.: How2sign: a large-scale multimodal dataset for con-
tinuous american sign language. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. pp. 2735–2744 (2021) 2, 4, 5, 26
10. Ebling, S., Glauert, J.: Building a swiss german sign language avatar with jasigning
and evaluating it among the deaf community. Universal Access in the Information
Society 15, 577–587 (2016) 4, 26
11. Efthimiou, E., Fotinea, S.E., Hanke, T., Glauert, J., Bowden, R., Braffort, A., Col-
let, C., Maragos, P., Goudenove, F.: Dicta-sign: sign language recognition, genera-
tion and modelling with application in deaf communication. In: sign-lang@ LREC
2010. pp. 80–83. European Language Resources Association (ELRA) (2010) 4, 26
12. Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges,
O.: ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In:
IEEE Conference on Computer Vision and Pattern Recognition (2023) 4
13. Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regres-
sion of expressive bodies using moderation. In: International Conference on 3D
Vision (3DV) (2021) 10, 11
14. Forte, M.P., Kulits, P., Huang, C.H.P., Choutas, V., Tzionas, D., Kuchenbecker,
K.J., Black, M.J.: Reconstructing signing avatars from video using linguistic priors.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. pp. 12791–12801 (2023) 2, 4, 5, 26
15. Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., Turki, A.: Interactive editing
in french sign language dedicated to virtual signers: Requirements and challenges.
Universal Access in the Information Society 15, 525–539 (2016) 4, 26
16 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

16. Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di-
verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161
(June 2022) 22
17. Hanke, T., Schulder, M., Konrad, R., Jahn, E.: Extending the public dgs corpus
in size and depth. In: sign-lang@ LREC 2020. pp. 75–82. European Language
Resources Association (ELRA) (2020) 2, 4, 5, 26
18. Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid,
C.: Learning joint reconstruction of hands and manipulated objects. In: Proceed-
ings of the IEEE/CVF conference on computer vision and pattern recognition. pp.
11807–11816 (2019) 4
19. Hu, H., Zhao, W., Zhou, W., Li, H.: Signbert+: Hand-model-aware self-supervised
pre-training for sign language understanding. IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence (2023) 4
20. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recog-
nition without temporal segmentation. In: Proceedings of the AAAI Conference on
Artificial Intelligence. vol. 32 (2018) 2, 4, 5, 26
21. Huang, W., Pan, W., Zhao, Z., Tian, Q.: Towards fast and high-quality sign lan-
guage production. In: Proceedings of the 29th ACM International Conference on
Multimedia. pp. 3172–3181 (2021) 12
22. Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., Luo, P.: Whole-
body human pose estimation in the wild. In: Proceedings of the European Confer-
ence on Computer Vision (ECCV) (2020) 8
23. Joo, H., Simon, T., Sheikh, Y.: Total capture: A 3d deformation model for tracking
faces, hands, and bodies. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 8320–8329 (2018) 4, 25
24. Joze, H.R.V., Koller, O.: Ms-asl: A large-scale data set and benchmark for under-
standing american sign language. arXiv preprint arXiv:1812.01053 (2018) 5
25. Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time fa-
cial surface geometry from monocular video on mobile gpus. arXiv preprint
arXiv:1907.06724 (2019) 6, 8
26. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: Part attention regres-
sor for 3d human body estimation. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 11127–11137 (2021) 8
27. Kratimenos, A., Pavlakos, G., Maragos, P.: Independent sign language recognition
with 3d body, hands, and face reconstruction. In: ICASSP 2021-2021 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP). pp.
4270–4274. IEEE (2021) 4, 26
28. Kruskal, J.B.: An overview of sequence comparison: Time warps, string edits, and
macromolecules. SIAM review 25(2), 201–237 (1983) 12
29. Lee, T., Oh, Y., Lee, K.M.: Human part-wise 3d motion context learning for sign
language recognition. In: Proceedings of the IEEE/CVF International Conference
on Computer Vision. pp. 20740–20750 (2023) 2, 12, 23, 24, 25
30. Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition
from video: A new large-scale dataset and methods comparison. In: Proceedings
of the IEEE/CVF winter conference on applications of computer vision. pp. 1459–
1469 (2020) 5, 6, 26
31. Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motionx: A large-
scale 3d expressive whole-body human motion dataset. In: Advances in Neural
Information Processing Systems (2023) 10, 11
SignAvatars 17

32. Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh
recovery with component aware transformer. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (2023) 4, 8, 10, 11, 26
33. Linde-Usiekniewicz, J., Czajkowska-Kisil, M., Łacheta, J., Rutkowski, P.: A corpus-
based dictionary of polish sign language (pjm) 4, 26
34. Linde-Usiekniewicz, J., Czajkowska-Kisil, M., Łacheta, J., Rutkowski, P.: A corpus-
based dictionary of polish sign language (pjm). In: Proceedings of the XVI EU-
RALEX International Congress: The user in focus. pp. 365–376 (2014) 6
35. Matthes, S., Hanke, T., Regen, A., Storz, J., Worseck, S., Efthimiou, E., Dimou,
A.L., Braffort, A., Glauert, J., Safar, E.: Dicta-sign-building a multilingual sign
language corpus. In: 5th Workshop on the Representation and Processing of Sign
Languages: Interactions between Corpus and Lexicon. Satellite Workshop to the
eighth International Conference on Language Resources and Evaluation (LREC-
2012) (2012) 6
36. Moon, G., Choi, H., Lee, K.M.: Accurate 3d hand pose estimation for whole-body
3d human mesh estimation. In: Computer Vision and Pattern Recognition Work-
shop (CVPRW) (2022) 10
37. Naert, L., Larboulette, C., Gibet, S.: A survey on the animation of signing avatars:
From sign representation to utterance synthesis. Computers & Graphics 92, 76–98
(2020) 2, 25
38. Nocedal, J., Wright, S.J.: Numerical optimization. Springer (1999) 23
39. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas,
D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single
image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR) (2019) 2, 3, 4, 5, 6, 10, 11, 20, 25, 26
40. Prillwitz, S., Hanke, T., König, S., Konrad, R., Langer, G., Schwarz, A.: Dgs cor-
pus project–development of a corpus based electronic dictionary german sign lan-
guage/german. In: sign-lang@ LREC 2008. pp. 159–164. European Language Re-
sources Association (ELRA) (2008) 6, 26
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
natural language supervision. In: International conference on machine learning. pp.
8748–8763. PMLR (2021) 9
42. Renz, K., Stache, N.C., Fox, N., Varol, G., Albanie, S.: Sign segmentation with
changepoint-modulated pseudo-labelling. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 3403–3412 (2021) 4,
26
43. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing
hands and bodies together. arXiv preprint arXiv:2201.02610 (2022) 3, 5
44. Rong, Y., Shiratori, T., Joo, H.: Frankmocap: A monocular 3d whole-body pose es-
timation system via regression and integration. In: IEEE International Conference
on Computer Vision Workshops (2021) 10
45. Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L.,
Metze, F.: How2: a large-scale dataset for multimodal language understanding.
arXiv preprint arXiv:1811.00347 (2018) 5
46. Saunders, B., Camgoz, N.C., Bowden, R.: Adversarial training for multi-channel
sign language production. arXiv preprint arXiv:2008.12405 (2020) 12, 22
47. Saunders, B., Camgoz, N.C., Bowden, R.: Progressive transformers for end-to-
end sign language production. In: Computer Vision–ECCV 2020: 16th European
Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. pp. 687–
705. Springer (2020) 4, 12, 21, 22, 26, 29
18 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

48. Saunders, B., Camgoz, N.C., Bowden, R.: Continuous 3d multi-channel sign lan-
guage production via progressive transformers and mixture density networks. In-
ternational journal of computer vision 129(7), 2113–2135 (2021) 22
49. Saunders, B., Camgoz, N.C., Bowden, R.: Mixed signals: Sign language production
via a mixture of motion primitives. In: Proceedings of the IEEE/CVF International
Conference on Computer Vision. pp. 1919–1929 (2021) 4, 12, 22, 26
50. Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., Cormier, K.: Building the
british sign language corpus. LaNguagE DocumENtatIoN & coNSErvatIoN (2013)
5
51. Sincan, O.M., Keles, H.Y.: Autsl: A large scale multi-modal turkish sign language
dataset and baseline methods. IEEE Access 8, 181340–181355 (2020) 5
52. Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3d
hand pose estimation via biomechanical constraints. In: European conference on
computer vision. pp. 211–228. Springer (2020) 6, 8
53. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human
motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 12, 13, 24
54. The World Federation, o.t.D.: Who we are, http://wfdeaf.org/who-we-are/ 1
55. Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic–static unsupervised sequen-
tiality, statistical subunits and lexicon for sign language recognition. Image and
Vision Computing 32(8), 533–549 (2014) 4, 26
56. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning.
Advances in neural information processing systems 30 (2017) 3
57. Von Agris, U., Knorr, M., Kraiss, K.F.: The significance of facial features for au-
tomatic sign language recognition. In: 2008 8th IEEE international conference on
automatic face & gesture recognition. pp. 1–6. IEEE (2008) 5
58. Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines
for human pose estimation. Advances in Neural Information Processing Systems
35, 38571–38584 (2022) 6, 8
59. Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., Black, M.J.: Gen-
erating holistic 3d human motion from speech. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (2023) 2, 4, 25
60. Yu, Z., Huang, S., Chen, F., Breckon, T.P., Wang, J.: Acr: Attention collaboration-
based regressor for arbitrary two-hand reconstruction. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
(June 2023) 8
61. Zhang, H., Tian, Y., Zhang, Y., Li, M., An, L., Sun, Z., Liu, Y.: Pymaf-x: Towards
well-aligned full-body model regression from monocular images. IEEE Transactions
on Pattern Analysis and Machine Intelligence (2023) 11
62. Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen,
X.: T2m-gpt: Generating human motion from textual descriptions with discrete
representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (2023) 12, 24
63. Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estima-
tion from a single image by hierarchical mesh deformation. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. pp. 4491–4500
(2019) 5
64. Zwitserlood, I., Verlinden, M., Ros, J., Van Der Schoot, S., Netherlands, T.: Syn-
thetic signing for the deaf: Esign. In: Proceedings of the conference and workshop
on assistive technologies for vision and hearing impairment (CVHI) (2004) 4, 26
SignAvatars 19

A Additional Visualizations of SignAvatars Dataset

Fig. 9: More sentence-level spoken language examples of SignAvatars, ASL subset. We


have different shapes of annotations presenting the accurate body and hand estimation.

In this section, we present more samples and visualizations of our SignAvatars


dataset for each of the subsets categorized by the annotation type: spoken lan-
guage (sentence-level), HamNoSys, and word-level prompt annotation.

A.1 Qualitative Analysis of SignAvatars Dataset

We provide further details of our SignAvatars dataset and present more visu-
alization of our data in Fig. 9, Figs. 10 and 11. Being the first large-scale
multi-prompt 3D sign language (SL) motion dataset with accurate holistic mesh
representations, our dataset enables various tasks such as 3D sign language
20 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

Fig. 10: More HamNoSys-level examples of SignAvatars, HamN oSys subset. We have
different shapes of annotations presenting the accurate body and hand estimation.

recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs
like text scripts, individual words, and HamNoSys notation. We also provide a
demo video in the supplementary materials and our anonymous project page:
https://anonymoususer4ai.github.io/.

A.2 More generation samples from SignVAE


We now share snapshot examples produced from our SignVAE, demonstrating
the application potential for 3D sign language production in our demo video on
project page.

B Analysis of annotation pipeline


In this section, we provide further analysis of our annotation pipeline. Since there
is not yet an existing benchmark for SL reconstruction while our method is not
limited to SL video, we provide more in-the-wild examples with our annotation
methods in Fig. 13 to demonstrate the reconstruction ability of our annotation
pipeline. Moreover, Fig. 12 illustrates more qualitative comparison with state-
of-the-art methods on EHF dataset [39], where we can observe that our method
SignAvatars 21

Fig. 11: More word-level examples of SignAvatars, word subset. We have different
shapes of annotations presenting the accurate body and hand estimation.

provides significantly better quality regarding pixel alignment, especially with


more natural and plausible hand poses. Subsequently, the biomechanical con-
straints can serve as a prior for eliminating the implausible poses, which happens
frequently in complex interacting-hands scenarios for other monocular capture
methods, as shown in Fig. 14.

C Evaluation of SignVAE generation on other


benchmarks

In this section, we aim to conduct further experiments with our SignVAE on 3D


SLP from spoken language on other benchmarks to further showcase its ability.
To the best of our knowledge, no publicly available benchmark for 3D mesh
& motion-based SLP exists. Progressive Transformer [47] and its continuation
22 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

series [46, 48, 49] on RWTH-PHOENIX-Weather 2014 T dataset [7] provides a


keypoint-based 3D Text2Pose (Language2Motion) benchmark. Unfortunately,
since, at the time of submission, this benchmark was not publicly available. Note
that, conducting back-translation evaluations as in [47] must strictly follow the
rule to use the same back-translation model checkpoint for a fair comparison.
This is also the same for the human motion generation area, where all the eval-
uations should be conducted with the same evaluation checkpoints such as the
popular HumanML3D benchmark does [16]. Unfortunately, the pretrained eval-
uation model checkpoint or its reproductions are available neither on the project
website https://github.com/BenSaunders27/ProgressiveTransformersSLP (with
an open issue) or on other sites, We have not managed to get in touch with the
corresponding authors. For this reason, we have re-evaluated the bench-
mark method in [47] as follows:

Experimental Details. To conduct evaluations on Phoenix-2014T using the


Progressive Transformer (PT) [47], we trained our network as well as PT on
this dataset and recorded new results under our metrics. We conduct the re-
evaluation experiments by:

– First, we generate mesh annotations for the Phoenix-2014T dataset and add
them as our subsets GSL. We follow the original data distribution and official
split to train our network.
– Second, because in addition to the absence of the evaluation model, the gen-
eration model checkpoints are also lacking, we re-train PT using the official
implementation on both 3D-lifted OpenPose keypoints JP T and the 3D key-
points Jours regressed from our mesh representation, corresponding to PT
(JP T ) and PT (Jours ).
– Third, we train two 3D keypoints-based SL motion evaluation models on
this subset with JP T and Jours , resulting in two model checkpoints CP T and
Cours .

Comparisons. We conduct both quantitative and qualitative comparisons be-


tween the PT and our method, following the official split with both CP T and
Cours in Tab. 7 under our evaluation metrics introduced in Sec. 5 and Ap-
pendix D.1. As shown in Tab. 7, our method significantly outperforms PT,
especially regarding the R-precision and MR-precision, which indicates better
prompt-motion consistency. Moreover, we can discover from the evaluation of
Real Motion that the evaluation model Cours utilizing the 3D keypoints Jours
regressed from our mesh representation can provide essentially better matching
accuracy with less noise (MM-dist) than the noisy canonical 3d-lifted OpenP ose
keypoints JP T , yielding better performance than using CP T . A carefully de-
signed evaluation model with proper training data will significantly improve the
ability to reflect the authentic performance of the experiments and will be less
likely to disturb our analysis as those in the results of CP T .
SignAvatars 23

R-Precision(↑) MR-Precision (↑)


Eval. Model Method FID (↓) MM-dist (↓)
top 1 top 3 top 5 top 1 top 3 top 5
±.006
Real Motion 0.193 0.299±.002 0.413±.005 0.075±.066
5.151±.033
- - -
±.009 ±.005
PT (JP T ) 0.035 0.082 0.195±.004 4.855±.062 7.977±.023 0.088 ±.012
0.145 ±.012
0.212±.019
CP T
PT (Jours ) 0.078±.004 0.149±.002 0.267±.003 5.135±.024 8.135±.019 0.138±.009 0.195±.023 0.311±.011
Ours 0.165±.006 0.275±.009 0.356±.003 4.194±.037 4.899±.029 0.219±.017 0.325±.015 0.443±.056
Real Motion 0.425±.004 0.635±.006 0.733±.009 0.015±.059 2.413±.051 - - -
PT (JP T ) 0.095±.004 0.155±.005 0.286±.002 3.561±.035 4.565±.027 0.175±.002 0.301±.010 0.419±.034
Cours
PT (Jours ) 0.134±.002 0.285±.003 0.395±.005 3.157±.021 3.977±.024 0.216 ±.005
0.363 ±.006
0.489±.002
Ours 0.389±.006 0.575±.009 0.692±.005 1.335±.003 2.856±.009 0.497±.006 0.691±.004 0.753±.015

Table 7: Quantitative comparison on Phoenix-2014 dataset, where Real Motion and


Ours are evaluated by extracting the 3D keypoints from our mesh representation. The
JP T and Jours in the bracket represent being trained on the corresponding keypoints.

Furthermore, we also qualitative comparison results in Fig. 15. Please see more
visualizations in our supplementary video, and project page.
Discussion. With SignAvatars, our goal is to provide an up-to-date, publicly
available 3D holistic mesh motion-based SLP benchmark and we invite the
community to participate. As an alternative for the re-evaluation, we can also
develop a brand new 3D sign language translation (SLT) method to re-evaluate
PT and compare it with our method on BLEU and ROUGE. As a part of our fu-
ture work on SL understanding, we also encourage the SL community to develop
back-translation and mesh-based SLT methods trained with our benchmark. We
believe that the 3D holistic mesh representation presents significant improve-
ments for the accurate SL-motion correlation understanding, compared to the
pure 2D methods as shown in Tab. 4 and Tab. 5 of the main paper, which was
also proved to be true in a latest 3D SLT work [29].

D Implementation details for experiments and evaluation

Optimization strategy of automatic annotation pipeline. During opti-


mization, we utilize an iterative five-stage fitting procedure to minimize the ob-
jective function and use Adam optimizer with 1e-2 as the learning rate. Moreover,
a good initialization can significantly boost the fitting speed of our annotation
pipeline. At the same time, a well-pixel-aligned body pose will also help the re-
construction of hand meshes. Motivated by this, we apply 2000 fitting steps for
a clip and split the fitting steps into five stages with 400 steps in each stage to
formulate our iterative fitting pipeline. In the meantime, the Limited-memory
BFGS [38] with a strong Wolfe line is applied to our optimization. In the first
three stages, all the loss and parameters are optimized together. The weights
wbody = whand = 1 are applied for LJ to obtain a good body pose estimation.
In the last two stages, we will first extract a mean pose from the record of the
previous optimization to gain a stable body shape and freeze it as a fixed shape,
as the signer will not change in a video by default. Subsequently, to obtain accu-
rate and detailed hand meshes, we will enlarge the whand to 2 to reach the final
holistic mesh reconstruction with a natural and accurate hand pose.
24 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

D.1 Evaluation Protocols

In this subsection, we will elaborate on the computational details of our used


evaluation protocol. To start with, our evaluation relies on a text-motion em-
bedding model following prior arts [29, 53, 62]. For simplicity, we use the same
symbols and notations as in our Sec. 3 and Sec. 4 of the main paper. Through
the GRU embedding layer, we embed our motion representation M1:T and lin-
l
guistic feature E1:s into fm ∈ Rd and fl ∈ Rd with the same dimensions to apply
contrastive loss and minimize the feature distances, where d = 512 is used in our
experiments. After motion and prompt feature extraction, we compute each of
the evaluation metrics, which are summarized below:
– Frechet Inception Distance (FID) (↓), the distributional distance be-
tween the generated motion and the corresponding real motion based on the
extracted motion feature.
– Diversity, the average Euclidean distance in between the motion features of
ND = 300 randomly sampled motion pairs.
– R-precision (↑), the average accuracy at top-k positions of sorted Euclidean
distances between the motion embedding and each GT prompt embedding.
– Multimodality, average Euclidean distance between the motion feature of
Nm = 10 pairs of motion generated with the same single input prompt.
– Multimodal Distance (MM-Dist) (↓), average Euclidean distance be-
tween each generated motion feature and its input prompt feature.
– MR-precision (↓), the average accuracy at top-k positions of sorted Eu-
clidean distance between a generated motion feature and 16 motion samples
from dataset (1 positive + 15 negative).
We now provide further details in each of those. For simplicity, we denote the
dataset length as N below.
Frechet Inception Distance (FID) is used to evaluate the distribution dis-
tance between the generated motion and the corresponding real motion:

FID = \|\mu _{gt} - \mu _{pred}\|_{2} - Tr(C_{gt} + C_{pred} - 2(C_{gt}C_{pred})^{1/2}) (7)

where µgt , µpred are the mean values for the features of real motion and generated
motion, separately. C, T r are the covariance matrix and trace of a matrix.
Diversity is used for evaluating the variance of the generated SL motion. Specifi-

cally, we randomly sample ND = 300 motion feature pairs {fm , fm } and compute
the average Euclidean distance between them by:

Diversity = \frac {1}{N_{D}} \sum _{i}^{N_{D}}\| f_{m}^{i} - f_{m}^{i'} \| (8)

Multimodality is leveraged to measure the diversity of the SL motion generated


from the same prompts. Specifically, we compute the average Euclidean distance
j j′
between the extracted motion feature of Nm = 10 pairs {fm , fm } of motion
generated with the same single input prompt. Through the full dataset, it can
SignAvatars 25

be written as:

Multimodality = \frac {1}{N*N_{m}}\sum _{i}^{N}\sum _{j}^{N_{M}}\| f_{m}^{ij} - f_{m}^{ij'} \| (9)

Multimodal Distance (MM-Dist) is applied to evaluate the text-motion


correspondency. Specifically, it computes the average Euclidean distance between
each generated motion feature and its input prompt feature:

MM\textsc {-}Dist = \frac {1}{N}\sum _{i}^{N}\| f_{m}^{i} - f_{l}^{i} \| (10)

E Discussion
E.1 Related Work
In this section, we present more details about the related work as well as the
open problems. Background. Existing SL datasets, and dictionaries are typi-
cally limited to 2D, which is ambiguous and insufficient for learners as introduced
in [29], different signs could appear to be the same in 2D domain due to depth
ambiguity. In that, 3D avatars and dictionaries are highly desired for efficient
learning [37], teaching, and many downstream tasks. However, The creation of
3D avatar annotation for SL is a labor-intensive, entirely manual process con-
ducted by SL experts and the results are often unnatural [3]. As a result, there is
not a unified large-scale multi-prompt 3D sign language holistic motion dataset
with precise hand mesh annotations. The lack of such 3D avatar data is a huge
barrier to bringing these meaningful applications to Deaf community, such as
3D sign language production (SLP), 3D sign language recognition (SLR), and
many downstream tasks such as digital simultaneous translators between spoken
language and sign language in VR/AR.
Open problems. Overall, the open problems chain is: 1) Current 3D avatar
annotation methods for sign language are mostly done manually by SL experts
and are labor-intensive. 2) Lack of generic automatic 3D expressive avatar an-
notation methods with detailed hand pose. 3) Due to the lack of a generic
annotation method, there is also a lack of a unified large-scale multi-prompt 3D
co-articulated continuous sign language holistic motion dataset with precise hand
mesh annotations. 4) Due to the above constraints, it is difficult to extend sign
language applications to highly desired 3D properties such as 3D SLR, 3D SLP,
which can be used for many downstream applications like virtual simultaneous
SL translators, 3D dictionaries, etc.
According to the problem chain, we will introduce the SoTA from three aspects:
3D holistic mesh annotation pipeline, 3D sign language motion dataset, and 3D
SL applications.
3D holistic mesh annotation: There are a lot of prior works for reconstruct-
ing holistic human body from RGB images with parametric models like SMPL-
X [39], Adam [23]. Among them, TalkSHOW [59] proposes a fitting pipeline
26 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

based on SMPLify-X [39] with a photometric loss for facial details. OSX [32]
proposes a time-consuming finetune-based weakly supervision pipeline to gener-
ate pseudo-3D holistic annotations. However, such expressive parametric models
have rarely been applied to the SL domain. [27] use off-the-shelf methods to
estimate holistic 3D mesh on the GSLL sign-language dataset [55]. In addition
to that, only a concurrent work [14] can reconstruct 3D holistic mesh annotation
using linguistic priors with group labels obtained from a sign-classifier trained
on Corpus-based Dictionary of Polish Sign Language (CDPSL) [33], which is
annotated with HamNoSys As such, it utilizes an existing sentence segmenta-
tion methods [42] to generalize to multiple-sign videos. These methods cannot
deal with the challenging self-occlusion, hand–hand and hand–body interactions
which makes them insufficient for complex interacting hand scenarios such as
sign language. There is not yet a generic annotation pipeline that is sufficient to
deal with complex interacting hand cases in continuous and co-articulated
SL videos.
Sign language datasets. While there have been many well-organized continu-
ous SL motion datasets [1, 2, 7, 9, 17, 20] with 2D videos or 2D keypoints annota-
tions, the only existing 3D SL motion dataset with 3D holistic mesh annotation
is in [14], which is purely isolated sign based and not sufficient for tackling
real-world applications in natural language scenarios. There is not yet a unified
large-scale multi-prompt 3D SL holistic motion dataset with continuous and
co-articulated signs and precise hand mesh annotations.
SL applications. Regarding the SL applications, especially sign language pro-
duction (SLP), [4] can generate 2D motion sequences from HamNoSys. [47]
and [49] are able to generate 3D keypoint sequences with glosses. The avatar
approaches are often hand-crafted and produce robotic and unnatural move-
ments. Apart from them, there are also early avatar approaches [5, 10, 11, 15, 64]
with a pre-defined protocol and character.

E.2 Licensing

Our dataset will first be released under the CC BY-NC-SA (Attribution-


NonCommercial-Share-Alike) license for research purposes. Specifically, we will
release the SMPL-X/MANO annotation and provide the instruction to extract
the data instead of distributing the raw videos. We also elaborate on the license
of the data source we used in our dataset collection:
How2Sign [9]. Creative Commons Attribution-NonCommercial 4.0 Interna-
tional License.
DGS Corpus [40]. is under CC BY-NC license.
Dicta-Sign. is under CC-BY-NC-ND 4.0 license.
WLASL [30]. Computational Use of Data Agreement (C-UDA-1.0).
SignAvatars 27

Input Ours OSX PIXIE

Fig. 12: Comparisons of existing 3D holistic human mesh reconstruction methods on


EHF dataset. Our annotation method produces significantly better holistic reconstruc-
tions with plausible poses, as well as the best pixel alignment. (Zoom in for a better
view)
28 Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal

Fig. 13: Our 3D holistic human mesh reconstruction methods on in-the-wild cases.
(Zoom in for a better view)

Front view Side view

Input w/o Lbio Full w/o Lbio Full


Fig. 14: Visualization examples and analysis of our regularization term. The biome-
chanical constraints can alleviate the implausible poses caused by monocular depth
ambiguity, which happens occasionally in complex interacting-hands scenarios for other
monocular capture methods.
SignAvatars 29

Input “in der nacht zehn bis sechzehn grad in einigen mittelgebirgstälern nur
einstellige werte.”

PT

Ours

Ground
Truth
Video

“und auch am samstag im osten noch freundlich im westen dann zum teil
Input kräftige schauer.”

PT

Ours

Ground
Truth
Video

Input “am montag mal sonne mal wolken mit nur wenigen schauern.”

PT

Ours

Ground
Truth
Video

Fig. 15: Qualitative comparison with PT [47] on Phoenix-2014 T dataset.

You might also like