Sign Avatar
Sign Avatar
1 Introduction
                                        According to the World Health Organization, there are 466 million Deaf and
                                        hard-of-hearing people [8]. Among them, there are over 70 million who commu-
                                        nicate via sign languages (SLs) resulting in more than 300 different SLs across
                                        different communities [54]. While the field of (spoken) natural language process-
                                        ing (NLP) and language assisted computer vision (CV) are well explored, this is
                                        ⋆
                                            Corresponding author.
                                        3
                                            https://signavatars.github.io/
2          Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
SignAvatars
                                                                                                                             a) Data Collection
      (i) 3D SLP from language                          (ii) 3D SLP from HamNoSys              (iii) 3D SLP from word
not the case for the alternate and important communicative tool of SL, and ac-
curate generative models of holistic 3D avatars as well as dictionaries are highly
desired for efficient learning [37].
    We argue that the lack of large scale, targeted SL datasets is an important
reason for this gap putting a barrier in front of downstream tasks such as digital
simultaneous SL translators. On one hand, existing SL datasets and dictionar-
ies [1,2,7,9,17,20] are typically limited to 2D videos or 2D keypoints annotations,
which are insufficient for learners [29] as different signs could appear to be the
same in 2D domain due to depth ambiguity. On the other hand, while paramet-
ric holistic models exist for human bodies [39] or bodies & faces [59], there is
no unified, large-scale, multi-prompt 3D holistic motion dataset with accurate
hand mesh annotations, which are crucial for SL. The reason for this is that
the creation of 3D avatar annotation for SL is a labor-intensive, entirely manual
process conducted by SL experts and the results are often unnatural [3].
    To address this challenge, we begin by gathering various data sources from
public datasets of continuous online videos with mixed-prompt annotations in-
cluding HamNoSys, spoken language, and word and introduce the SignAvatars
dataset. Overall, we compile 70K videos from 153 signers amounting to 8.34M
frames. Unlike [14], our dataset is not limited to isolated signs, i.e. single sign
per video, where HamNoSys-annotations are present, but includes continuous
and co-articulated signs. To augment our dataset with 3D full-body annota-
tions, including 3D body, hand and face meshes as well as 2D & 3D keypoints,
                                                              SignAvatars      3
2 Related Work
3   SignAvatars Dataset
Overview. SignAvatars is a holistic motion dataset composed of 70K video
clips having 8.34M frames in total, containing body, hand and face motions
as summarized in Tab. 2. We compile SignAvatars by gathering various data
sources from public datasets to online videos and form seven subsets, whose
distribution is reported in Fig. 2. Since the individual subsets do not naturally
contain expressive 3D whole-body motion labels and 2D keypoints, we introduce
a unified automatic annotation framework providing rich 3D holistic parametric
SMPL-X annotations along with MANO subsets for hands. Overall, we provide
117 hours of 70K video clips with 8.34M frames of motion data with accurate
expressive holistic 3D mesh as motion annotations.
                                                                                SignAvatars            5
Data                   Video   Frame Duration (hours) Co-articulated   Pose Annotation (to date)  Signer
RWTH-Phoenix-2014T [7] 8.25K   0.94M        11              C                      -                 9
DGS Corpus [17]           -       -         50              C                 2D keypoints          327
BSL Corpus [50]           -       -        125              C                      -                249
MS-ASL [24]             25K       -         25               I                     -                222
WL-ASL [30]             21K    1.39M        14               I                2D keypoints          119
How2Sign [9]            34K     5.7M        79              C             2D keypoints, depth*      11
CSL-Daily [20]          21K       -         23              C              2D keypoints, depth      10
SIGNUM [57]             33K       -         55              C                      -                25
AUTSL [51]              38K       -         21               I                   depth              43
SGNify [14]            0.05K     4K          -               I             body mesh vertices        -
SignAvatars (Ours)      70K    8.34M       117            Both       SMPL-X, MANO, 2D&3D keypoints 153
multi-view frames, resulting in 34K clips for the ASL subset. For GSL subset,
we mostly gathered data from the publicly available PHOENIX14T dataset [7]
following the official split to have 8.25K video clips. For HamNoSys subset,
we collect 5.8K isolated-sign SL video clips from Polish SL corpus [34] for PJM,
and German Sign Language (DGS), Greek Sign Language (GRSL) and French
Sign Language (LSF) from DGS Corpus [40] and Dicta-Sign [35]. We finally
gathered 21K clips from word-level sources such as WLASL [30] to curate the
isolated-sign word subset. Overall, we divide our dataset into four subsets: (i))
word, (ii) ASL, (iii) HamNoSys, (iv) GSL based on the prompt categories as
shown in Fig. 2.
                  12000
                                                                                                                                       Word
                                                                                                                                       ASL
                                                                                                                                                                                   Data Video Frame                                                     Type                       Signer
                  10000                                                                                                                HamNoSys
                                                                                                                                       GSL                                         Word                 21K                1.39M                          W                                 119
                  8000
                                                                                                                                                                                   PJM                  2.6K               0.21M                          H                                  2
Number of clips
To efficiently auto-label the SL videos with motion data given only RGB online
videos, we design an automatic 3D SL annotation pipeline that is not limited to
isolated signs. To ensure motion stability and 3D shape accuracy, while main-
taining efficiency during holistic 3D mesh recovery from SL videos, we propose
an iterative fitting algorithm minimizing an objective heavily regularized both
holistically and by biomechanical hand constraints [52]:
\begin {split} E(\theta , \beta , \phi ) = \lambda _{J} L_{J} + \lambda _{\theta } L_{\theta } + \lambda _{\alpha } L_{\alpha } + \lambda _{\beta } L_{\beta } + \lambda _{s}L_{\mathrm {smooth}} + \lambda _{a}L_{\mathrm {angle}} + L_{\mathrm {bio}} \end {split} (1)
where θ is the full set of optimizable pose parameters, and ϕ is the facial ex-
pression. LJ represents the joint loss of 2D re-projection, which optimizes the
difference between joints extracted from the SMPL-X model, projected into the
image, with joints predicted with ViTPose [58] and MediaPipe [25]. The 3D joints
can be jointly optimized in LJ when GT is available. Lθ is the pose prior term
following SMPLify-X [39]. Moreover, Lα is a prior penalizing extreme bending
only for elbows and knees and Lβ is the shape prior term. In addition, Lsmooth ,
Langle and Lbio are the smooth-regularization loss, angle loss and biomechanical
constraints, separately. Finally, each λ denotes the influence weight of each loss
term. Please refer to the appendix for more details. In what follows, we describe
in detail our regularizers.
                                                                                                                                                                                                                                                                                                                                                                                      SignAvatars                                                         7
                                                                                                                                                                                                                                                                                                                                                       Optimization
                                                                                                                                                                                                                             Projection
                                                                                                                                                                                                             SMPL-X
                                                                                                                                                                                                                                                                                                                                                                                       ……
                                            Detector
                                                                                                                        θb, θh, β, cam, φ                                              fusion
                                                                          OSX
                                                                                                                                                                                                                                                                                                                                                                                     ……
                                                                                                                                                                 θh
                                                                                                                                         ACR                                                                                                                    Kbody                      Khand                       Kface                                                                                                  Mt = (θb, θh, β, cam, φ)
                                                                                                                                                                                                       Initialization                                                                                                                                             Optimization & Reg
                                                                                                                                                                                                                                                                        ViTPose                             MediaPipe
                                                                                                                                                       2D Pose Estimation
Fig. 3: Overview of our automatic annotation pipeline. Given an RGB image sequence
as input, we perform a hierarchical initialization, followed by an optimization involving
temporal smoothness and biomechanical constraints. Finally, our pipeline outputs the
final 3D motion results as a sequence of SMPL-X parameters.
Holistic regularization. To reduce the jitter on body and hand motion, caused
by the noisy 2D detected keypoints, we employ a smoothness term defined as:
L_{\mathrm {smooth}} = \sum _{t}\left (\|\hat {\theta }^{b}_{1:T}\|_{2} + \|\Tilde {\theta }_{1:T}^{h}\|_{2} + \| \theta ^{h}_{2:T} - \theta ^{h}_{1:T-1} \|_{2} + \| \theta ^{b}_{2:T} - \theta ^{b}_{1:T-1} \|_{2}\right ) (2)
        b
where θ̂1:T  ∈ RN ×jb ×3 is the selected subset of pose parameters from θ1:T b
                                                                                 ∈
RN ×J×3 , and N is the frame number of the video. θ̃h ∈ RN ×jh is the selected
subset of hand parameters from θ1:T b
                                       ∈ RN ×J×3 . jb and jh are the numbers of
selected body joints and hand parameters, Moreover, this could prevent implau-
sible poses along the bone direction such as twists. Additionally, we penalize the
hand poses lying outside the plausible range by adding an angle limit prior term:
L_{\mathrm {angle}} = \sum _{t}(\mathcal {I}(\|\theta ^{h}_{1:T}\|_{2}; \theta _{\mathrm {min}}^{h}, \theta _{\mathrm {max}}^{h}) + \mathcal {I}(\| \theta ^{b}_{1:T}\|_{2}; \theta _{\mathrm {min}}^{b}, \theta _{\mathrm {max}}^{b})) (3)
                                                          h,b    h,b
where I is the interval loss penalizing the outliers, θmin    , θmax is the pre-defined
interval, θh , θb is the selected subset of holistic poses. Finally, the signer in each
video clip will not change, so we can use the optimized consistent shape
parameters β to represent the holistic body shape. Specifically, our fitting
procedure is split into five stages, where we will optimize the shape for the first
three stages of optimization to derive the mean shape and freeze the shape in
the following stages.
Biomechanical hand constraints. Hand pose estimation from monocular
RGB images is challenging due to fast movements, interaction, frequent occlu-
sion and confusion. To further improve the hand motion quality and eliminate
implausible hand pose, we apply biomechanically constrain the hand poses, using
three losses: (i) Lbl for bone length, (ii) Lpalm for palmar region optimization,
and (iii) Lja for joint angle priors. Specifically, the final biomechanical loss Lbio
is defined as the weighted sum Lbio = λbl Lbl + λpalm Lpalm + λja Lja , with:
                          \centering \begin {split} &L_{\mathrm {bl}} = \sum _{i}\mathcal {I}(\|b^{i}_{1:T}\|_{2}; b_{\mathrm {min}}^{i}, b_{\mathrm {max}}^{i}), \quad L_{\mathrm {ja}} = \sum _{i}D(\alpha _{1:T}^{i}, H^{i})\\ &L_{\mathrm {palm}} = \sum _{i}(\mathcal {I}(\|c^{i}_{1:T}\|_{2}; c_{\mathrm {min}}^{i}, c_{\mathrm {max}}^{i}) + \mathcal {I}(\|d^{i}_{1:T}\|_{2}; d_{\mathrm {min}}^{i}, d_{\mathrm {max}}^{i})), \end {split} 
                                                                                                                                                                                                                                                                                                                                                                                                                                                         (4)
8       Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
where I is the interval loss penalizing the outliers, bi is the bone length of ith
finger bone and the optimization constraints the whole sequence [1 : T ]. We
further constrain the curvature and angular distance for the four root bones
supporting the palmar structures by penalizing the outliers of curvature range
cimax , cimin and angular distance range dimax , dimin . Inspired by [52], we also apply
                                                 i          f     a
constraints to the sequence of joint angles α1:T     = (α1:T   , α1:T ) by approximating
                       f   a                            i
the convex hull on (α , α ) plane with point set H and minimizing their distance
D, where (αf , αa ) is the flexion and abduction angles. The biomechanical loss is
then computed as the weighted sum of them: Lbio = λbl Lbl +λpalm Lpalm +λja Lja .
We refer the reader to our appendix for more details.
Hierarchical initialization. Given an RGB image sequence, we initialize the
holistic SMPL-X parameters from OSX [32]. Though, due to the frequent occlu-
sion and hand interactions, OSX is not always sufficient for a good initialization.
Therefore, we further fuse OSX with ACR [60], PARE [26] to improve stability
under occlusion and truncation. For 2D holistic keypoints initialization, we first
train a whole-body 2D pose estimation model on COCO-WholeBody [22] based
on ViTPose [58] and subsequently incorporate it with MediaPipe [25] by fusing
and feeding through a confidence-guided filter.
                             Embedding
            or                                                                                                                                               Train &
                                               Linguistic                            …                          Decoder
     “How are you?”                                                                                                                                          Inference
                                                Encoder
            or                                                                                                                                               Fusion
                                                                                     code index
         “Walk”                                                                                                                                              End token
autoregressive
                                                     Embedding
       Semantic                                                                 Autoregressive                                   …                   …
        Adapter                                                                  Transformer
                                                Motion                               …                         Decoder
                                                Encoder
                                                                        Motion Codebook 𝑍!
      Holistic Motion                                                                                                                         Holistic Motion
            𝑀1:t                                                                                                                                   𝑀" 1:t
Fig. 4: Our 3D SLP network, SignVAE, consists of two-stages. We first create semantic
and motion codebooks using two VQ-VAEs, mapping inputs to their respective code
indices. Then, by an auto-regressive model, we generate motion code indices based on
semantic code indices, ensuring a coherent understanding of the data.
by: fˆjm = arg min ∥fjm − zi ∥2 . Finally, the quantized latent features are fed into
                       zi ∈Z
decoders for reconstruction. In terms of the training of the SL motion generator,
we apply the standard optimization scheme with Lmotionv q :
L_{m-vq} = L_{recon}(M_{1:T}, \hat {M}_{1:T}) + \| sg[F_{1:T}^{m}] - \hat {F_{1:T}^{m}}\|_{2} + \beta \| F_{1:T}^{m} - \mathrm {sg}[\hat {F_{1:T}^{m}}]\|_{2} (5)
L_{l-vq} = L_{recon}(E^{l}_{1:s}, \hat {E}^{l}_{1:s}) + \| sg[F^{l}_{1:s}] - \hat {F^{l}_{1:s}}\|_{2} + \beta \| F^{l}_{1:s} - sg[\hat {F^{l}_{1:s}}]\|_{2} (6)
         l
where F1:s is the latent feature after encoding the initial linguistic feature. Fˆ1:sl
is the quantized linguistic feature after applying fˆjl = arg max ∥fjl − zi ∥2 to F1:s
                                                                                   l
                                                                                       .
                                                                                                                         zi ∈Zl
Sign-motion cross modelling and production. After training the VQVAE-
based SL motion generator, we can map any motion sequence M1:T to a sequence
of indices X = [x1 , ..., xT /w , xEOS ] through the motion encoder and quantization,
where xEOS is a learnable end token representing the stop signal. After training
both the SL motion generator and the linguistic feature generator, our network
will be jointly optimized in a parallel manner. Specifically, we fuse the linguistic
feature embedding El and the codebook index vectors of Zl to formulate the final
condition for our autoregressive code index generator. The objective for training
the code index generator can be seen as an autoregressive next-index prediction
task, learned with a cross − entropy loss between the likelihood of the full pre-
dicted code index sequence and the real ones as LSLP = EX∼p(X) [− log p(X|c)].
     Lastly, with the quantized motion representation, we generate the codebook
vectors in a temporal autoregressive manner and predict the distribution of the
next codebook indices given an input linguistic prompt as linguistic condition c.
After mapping the codebook indices X to the quantized motion representation
  m
F̂1:(T /w) , we are able to decode and produce the final 3D holistic motion with
mesh representations M1:T .
5 Experimental Evaluation
model trained following prior arts in motion generation [53,62], we use the scores
and metrics of FID, Diversity, Multimodality (MM), MM-Dist, MR-precision,
whose details are provided in our supplementary material. Unfortunately, there
is no de-facto standard for evaluating 3D SLP in the literature. While [29] is
capable of back-translating 3D SL motion by treating it as a classification, it
is tailored only for word-level back-translation. While BLEU and ROUGE are
commonly used in the back-translation evaluation [47, 49], they are not generic
for other types of annotations such as HamNoSys or glosses. Since the generated
motion might differ in length from the real motion, absolute metrics like MPJPE
would also be unsuited. Inspired by [4, 21], we propose a new MR-Precision
for motion retrieval as well as DTW-MJE (Dynamic Time Warping - Mean
Joint Error) [28] with standard SMPL-X keypoint set without lower body, for
evaluating the performance of our method as well as the baselines.
Subsets & training settings. Specifically, we report results on three represen-
tative subsets: (i) the complete set of ASL for spoken language (corresponding
to language in Tab. 6), (ii) the word subset with 300 vocabularies, (iii) combined
subset of DGS, LSF, PJM, and GRSL for HamNoSys. For training, we follow
the official splits for (i) and (ii). For (iii), we leverage a four-fold strategy where
we train on three of them and test on the other, repeated four times to have the
final results.
Benchmarking & results. To the best of our knowledge, there is no publicly
available benchmark for 3D mesh & motion-based SLP5 . To evaluate SignAvatars
as the first 3D motion-based SLP benchmark, we present detailed quantitative
results in Tab. 6. It can be seen that the 3D SLP with word-level prompts
can achieve the best performance reaching the quality of real motions. Learning
from spoken languages is a naturally harder task and we invite the community to
develop stronger methods to produce 3D SLP from spoken languages. To further
evaluate the sign accuracy and effect of body movement, we report separate
results for individual arms (e.g. ”Gesture”), with slight improvements in FID
5
     [46] does not provide a public evaluation model as discussed in our Appendix.
                                                                                                             SignAvatars                 13
Table 6: Quantitative evaluation results for the 3D holistic SL motion generation. Real
motion denotes the motions sampled from the original holistic motion annotation in the
dataset. Holistic represents the generation results regarding holistic motion. Gesture
stands for the evaluation conducted on two arms. Div. refers to Diversity.
   Fig. 8 shows qualitative results from continuous 3D holistic body motion gen-
eration. As observed, our method can generate plausible and accurate holistic 3D
motion from a variety of prompts while containing some diversity enriching the
production results. We provide further examples on our supplementary material.
Ablation on PFLG. In order to study our unique text-sign cross-modeling
module, we introduce a baseline, SignVAE (Base), replacing the PLFG with a
canonical pre-trained CLIP feature as input to the encoder. As shown in Tab. 5,
14          Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
                                                                                                                                      “hamsymmpar, hamparbegin,
                                                                                             “hamsymmpar, hamflathand, hamparbegin,   hamfinger2, hamextfingerl, hampalmdl,
                                                                                             hamextfingerul, hampalmr, hamplus,       hamplus, hamflathand, hamaltbegin,
                                                                                             hamextfingerul, hampalmr, hamparend,     hamextfingeru, hampalmu, hammetaalt,
  Input     “So this is a really important tool to have in doing your photography.”          hamclose, hamparbegin, hammoveor,        hamextfingeru, hampalmd, hamaltend,
                                                                                             hamreplace, hamextfingero, hamparend.”   hamparend, hamtouch, hammoveo.”
Generated
 Motion
 Ground
  Truth
  Video
            “Well, regardless of where you find the monologue, if you put it together or      “league”                                  “notice”
  Input     even if you write it yourself, it is important that the monologue makes sense.
Generated
 Motion
 Ground
  Truth
  Video
Fig. 8: Qualitative results of 3D holistic SLP from different prompts (left row: spoken
language, top right: HamNoSys, bottom right: word). Within each sample, the first two
rows are the input prompts and the generated results. The last row is the corresponding
video clip from our dataset.
our joint scheme utilizing the PLFG can significantly improve the prompt-motion
consistency, resulting in an increase in R-precision and MM-dist. Moreover, our
VQVAE backbone quantizing the motion representation into a motion codebook,
enables interaction with the linguistic feature codebook, leading to significant
improvements in prompt-motion correspondences and outperforms other base-
lines built with our linguistic feature generator (SignDiffuse, Ham2Pose-3d) and
generates more text-motion consistent results.
6         Conclusion
We introduced SignAvatars, the first large-scale 3D holistic SL motion dataset
with expressive 3D human and hand mesh annotations, provided by our auto-
matic annotation pipeline. SignAvatars enables a variety of application potentials
for Deaf and hard-of-hearing communities. Built upon our dataset, we proposed
the first 3D sign language production approach to generate natural holistic mesh
motion sequences from SL prompts. We also introduced the first benchmark re-
sults for this new task, continuous and co-articulated 3D holistic SL motion
production from diverse SL prompts. Our evaluations on this benchmark clearly
showed the advantage of our new VQVAE-based SignVAE model, over the base-
lines, we develop.
Limitations and future work. Having the first benchmark at hand opens
up a sea of potential in-depth investigations of other 3D techniques for 3D SL
motion generation. Especially, the lack of a sophisticated and generic 3D back-
translation method may prevent our evaluations from fully showcasing the su-
periority of the proposed method. We leave this for a future study. Moreover,
combining 3D SLT and SLP to formulate a multi-modal generic SL framework
will be one of the future works. Developing a large 3D sign language motion
model with more properties and applications in AR/VR will significantly ben-
efit the Deaf and hard-of-hearing people around the world, as well as countless
hearing individuals interacting with them. As such, We invite the research com-
munity to develop even stronger baselines.
                                                                    SignAvatars       15
References
 1. Albanie, S., Varol, G., Momeni, L., Afouras, T., Chung, J.S., Fox, N., Zisserman,
    A.: BSL-1K: Scaling up co-articulated sign language recognition using mouthing
    cues. In: European Conference on Computer Vision (2020) 2, 4, 26
 2. Albanie, S., Varol, G., Momeni, L., Bull, H., Afouras, T., Chowdhury, H., Fox, N.,
    Woll, B., Cooper, R., McParland, A., Zisserman, A.: BOBSL: BBC-Oxford British
    Sign Language Dataset. https://www.robots.ox.ac.uk/~vgg/data/bobsl (2021)
    2, 4, 26
 3. Aliwy, A.H., Ahmed, A.A.: Development of arabic sign language dictionary using
    3d avatar technologies. Indonesian Journal of Electrical Engineering and Computer
    Science 21(1), 609–616 (2021) 2, 25
 4. Arkushin, R.S., Moryossef, A., Fried, O.: Ham2pose: Animating sign language no-
    tation into pose sequences. In: Proceedings of the IEEE/CVF Conference on Com-
    puter Vision and Pattern Recognition. pp. 21046–21056 (2023) 4, 12, 13, 26
 5. Bangham, J.A., Cox, S., Elliott, R., Glauert, J.R., Marshall, I., Rankov, S., Wells,
    M.: Virtual signing: Capture, animation, storage and transmission-an overview
    of the visicast project. In: IEE Seminar on speech and language processing for
    disabled and elderly people (Ref. No. 2000/025). pp. 6–1. IET (2000) 4, 26
 6. Blaisel, X.: David f. armstrong, william c. stokoe, sherman e. wilcox, gesture and
    the nature of language, cambridge, cambridge university press, 1995, x+ 260 p.,
    bibliogr., index. Anthropologie et Sociétés 21(1), 135–137 (1997) 5
 7. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language
    translation. In: Proceedings of the IEEE conference on computer vision and pattern
    recognition. pp. 7784–7793 (2018) 2, 4, 5, 6, 22, 26
 8. Davis, A.C., Hoffman, H.J.: Hearing loss: rising prevalence and impact. Bulletin of
    the World Health Organization 97(10), 646 (2019) 1
 9. Duarte, A., Palaskar, S., Ventura, L., Ghadiyaram, D., DeHaan, K., Metze, F.,
    Torres, J., Giro-i Nieto, X.: How2sign: a large-scale multimodal dataset for con-
    tinuous american sign language. In: Proceedings of the IEEE/CVF conference on
    computer vision and pattern recognition. pp. 2735–2744 (2021) 2, 4, 5, 26
10. Ebling, S., Glauert, J.: Building a swiss german sign language avatar with jasigning
    and evaluating it among the deaf community. Universal Access in the Information
    Society 15, 577–587 (2016) 4, 26
11. Efthimiou, E., Fotinea, S.E., Hanke, T., Glauert, J., Bowden, R., Braffort, A., Col-
    let, C., Maragos, P., Goudenove, F.: Dicta-sign: sign language recognition, genera-
    tion and modelling with application in deaf communication. In: sign-lang@ LREC
    2010. pp. 80–83. European Language Resources Association (ELRA) (2010) 4, 26
12. Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges,
    O.: ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In:
    IEEE Conference on Computer Vision and Pattern Recognition (2023) 4
13. Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regres-
    sion of expressive bodies using moderation. In: International Conference on 3D
    Vision (3DV) (2021) 10, 11
14. Forte, M.P., Kulits, P., Huang, C.H.P., Choutas, V., Tzionas, D., Kuchenbecker,
    K.J., Black, M.J.: Reconstructing signing avatars from video using linguistic priors.
    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition. pp. 12791–12801 (2023) 2, 4, 5, 26
15. Gibet, S., Lefebvre-Albaret, F., Hamon, L., Brun, R., Turki, A.: Interactive editing
    in french sign language dedicated to virtual signers: Requirements and challenges.
    Universal Access in the Information Society 15, 525–539 (2016) 4, 26
16      Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
16. Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating di-
    verse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5152–5161
    (June 2022) 22
17. Hanke, T., Schulder, M., Konrad, R., Jahn, E.: Extending the public dgs corpus
    in size and depth. In: sign-lang@ LREC 2020. pp. 75–82. European Language
    Resources Association (ELRA) (2020) 2, 4, 5, 26
18. Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid,
    C.: Learning joint reconstruction of hands and manipulated objects. In: Proceed-
    ings of the IEEE/CVF conference on computer vision and pattern recognition. pp.
    11807–11816 (2019) 4
19. Hu, H., Zhao, W., Zhou, W., Li, H.: Signbert+: Hand-model-aware self-supervised
    pre-training for sign language understanding. IEEE Transactions on Pattern Anal-
    ysis and Machine Intelligence (2023) 4
20. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recog-
    nition without temporal segmentation. In: Proceedings of the AAAI Conference on
    Artificial Intelligence. vol. 32 (2018) 2, 4, 5, 26
21. Huang, W., Pan, W., Zhao, Z., Tian, Q.: Towards fast and high-quality sign lan-
    guage production. In: Proceedings of the 29th ACM International Conference on
    Multimedia. pp. 3172–3181 (2021) 12
22. Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., Luo, P.: Whole-
    body human pose estimation in the wild. In: Proceedings of the European Confer-
    ence on Computer Vision (ECCV) (2020) 8
23. Joo, H., Simon, T., Sheikh, Y.: Total capture: A 3d deformation model for tracking
    faces, hands, and bodies. In: Proceedings of the IEEE conference on computer
    vision and pattern recognition. pp. 8320–8329 (2018) 4, 25
24. Joze, H.R.V., Koller, O.: Ms-asl: A large-scale data set and benchmark for under-
    standing american sign language. arXiv preprint arXiv:1812.01053 (2018) 5
25. Kartynnik, Y., Ablavatski, A., Grishchenko, I., Grundmann, M.: Real-time fa-
    cial surface geometry from monocular video on mobile gpus. arXiv preprint
    arXiv:1907.06724 (2019) 6, 8
26. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: Part attention regres-
    sor for 3d human body estimation. In: Proceedings of the IEEE/CVF International
    Conference on Computer Vision. pp. 11127–11137 (2021) 8
27. Kratimenos, A., Pavlakos, G., Maragos, P.: Independent sign language recognition
    with 3d body, hands, and face reconstruction. In: ICASSP 2021-2021 IEEE Inter-
    national Conference on Acoustics, Speech and Signal Processing (ICASSP). pp.
    4270–4274. IEEE (2021) 4, 26
28. Kruskal, J.B.: An overview of sequence comparison: Time warps, string edits, and
    macromolecules. SIAM review 25(2), 201–237 (1983) 12
29. Lee, T., Oh, Y., Lee, K.M.: Human part-wise 3d motion context learning for sign
    language recognition. In: Proceedings of the IEEE/CVF International Conference
    on Computer Vision. pp. 20740–20750 (2023) 2, 12, 23, 24, 25
30. Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition
    from video: A new large-scale dataset and methods comparison. In: Proceedings
    of the IEEE/CVF winter conference on applications of computer vision. pp. 1459–
    1469 (2020) 5, 6, 26
31. Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motionx: A large-
    scale 3d expressive whole-body human motion dataset. In: Advances in Neural
    Information Processing Systems (2023) 10, 11
                                                                   SignAvatars       17
32. Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh
    recovery with component aware transformer. In: Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (2023) 4, 8, 10, 11, 26
33. Linde-Usiekniewicz, J., Czajkowska-Kisil, M., Łacheta, J., Rutkowski, P.: A corpus-
    based dictionary of polish sign language (pjm) 4, 26
34. Linde-Usiekniewicz, J., Czajkowska-Kisil, M., Łacheta, J., Rutkowski, P.: A corpus-
    based dictionary of polish sign language (pjm). In: Proceedings of the XVI EU-
    RALEX International Congress: The user in focus. pp. 365–376 (2014) 6
35. Matthes, S., Hanke, T., Regen, A., Storz, J., Worseck, S., Efthimiou, E., Dimou,
    A.L., Braffort, A., Glauert, J., Safar, E.: Dicta-sign-building a multilingual sign
    language corpus. In: 5th Workshop on the Representation and Processing of Sign
    Languages: Interactions between Corpus and Lexicon. Satellite Workshop to the
    eighth International Conference on Language Resources and Evaluation (LREC-
    2012) (2012) 6
36. Moon, G., Choi, H., Lee, K.M.: Accurate 3d hand pose estimation for whole-body
    3d human mesh estimation. In: Computer Vision and Pattern Recognition Work-
    shop (CVPRW) (2022) 10
37. Naert, L., Larboulette, C., Gibet, S.: A survey on the animation of signing avatars:
    From sign representation to utterance synthesis. Computers & Graphics 92, 76–98
    (2020) 2, 25
38. Nocedal, J., Wright, S.J.: Numerical optimization. Springer (1999) 23
39. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas,
    D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single
    image. In: Proceedings IEEE Conf. on Computer Vision and Pattern Recognition
    (CVPR) (2019) 2, 3, 4, 5, 6, 10, 11, 20, 25, 26
40. Prillwitz, S., Hanke, T., König, S., Konrad, R., Langer, G., Schwarz, A.: Dgs cor-
    pus project–development of a corpus based electronic dictionary german sign lan-
    guage/german. In: sign-lang@ LREC 2008. pp. 159–164. European Language Re-
    sources Association (ELRA) (2008) 6, 26
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
    Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from
    natural language supervision. In: International conference on machine learning. pp.
    8748–8763. PMLR (2021) 9
42. Renz, K., Stache, N.C., Fox, N., Varol, G., Albanie, S.: Sign segmentation with
    changepoint-modulated pseudo-labelling. In: Proceedings of the IEEE/CVF Con-
    ference on Computer Vision and Pattern Recognition. pp. 3403–3412 (2021) 4,
    26
43. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing
    hands and bodies together. arXiv preprint arXiv:2201.02610 (2022) 3, 5
44. Rong, Y., Shiratori, T., Joo, H.: Frankmocap: A monocular 3d whole-body pose es-
    timation system via regression and integration. In: IEEE International Conference
    on Computer Vision Workshops (2021) 10
45. Sanabria, R., Caglayan, O., Palaskar, S., Elliott, D., Barrault, L., Specia, L.,
    Metze, F.: How2: a large-scale dataset for multimodal language understanding.
    arXiv preprint arXiv:1811.00347 (2018) 5
46. Saunders, B., Camgoz, N.C., Bowden, R.: Adversarial training for multi-channel
    sign language production. arXiv preprint arXiv:2008.12405 (2020) 12, 22
47. Saunders, B., Camgoz, N.C., Bowden, R.: Progressive transformers for end-to-
    end sign language production. In: Computer Vision–ECCV 2020: 16th European
    Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. pp. 687–
    705. Springer (2020) 4, 12, 21, 22, 26, 29
18      Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
48. Saunders, B., Camgoz, N.C., Bowden, R.: Continuous 3d multi-channel sign lan-
    guage production via progressive transformers and mixture density networks. In-
    ternational journal of computer vision 129(7), 2113–2135 (2021) 22
49. Saunders, B., Camgoz, N.C., Bowden, R.: Mixed signals: Sign language production
    via a mixture of motion primitives. In: Proceedings of the IEEE/CVF International
    Conference on Computer Vision. pp. 1919–1929 (2021) 4, 12, 22, 26
50. Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., Cormier, K.: Building the
    british sign language corpus. LaNguagE DocumENtatIoN & coNSErvatIoN (2013)
    5
51. Sincan, O.M., Keles, H.Y.: Autsl: A large scale multi-modal turkish sign language
    dataset and baseline methods. IEEE Access 8, 181340–181355 (2020) 5
52. Spurr, A., Iqbal, U., Molchanov, P., Hilliges, O., Kautz, J.: Weakly supervised 3d
    hand pose estimation via biomechanical constraints. In: European conference on
    computer vision. pp. 211–228. Springer (2020) 6, 8
53. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human
    motion diffusion model. arXiv preprint arXiv:2209.14916 (2022) 12, 13, 24
54. The World Federation, o.t.D.: Who we are, http://wfdeaf.org/who-we-are/ 1
55. Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic–static unsupervised sequen-
    tiality, statistical subunits and lexicon for sign language recognition. Image and
    Vision Computing 32(8), 533–549 (2014) 4, 26
56. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning.
    Advances in neural information processing systems 30 (2017) 3
57. Von Agris, U., Knorr, M., Kraiss, K.F.: The significance of facial features for au-
    tomatic sign language recognition. In: 2008 8th IEEE international conference on
    automatic face & gesture recognition. pp. 1–6. IEEE (2008) 5
58. Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines
    for human pose estimation. Advances in Neural Information Processing Systems
    35, 38571–38584 (2022) 6, 8
59. Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., Black, M.J.: Gen-
    erating holistic 3d human motion from speech. In: Proceedings of the IEEE/CVF
    Conference on Computer Vision and Pattern Recognition (2023) 2, 4, 25
60. Yu, Z., Huang, S., Chen, F., Breckon, T.P., Wang, J.: Acr: Attention collaboration-
    based regressor for arbitrary two-hand reconstruction. In: Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
    (June 2023) 8
61. Zhang, H., Tian, Y., Zhang, Y., Li, M., An, L., Sun, Z., Liu, Y.: Pymaf-x: Towards
    well-aligned full-body model regression from monocular images. IEEE Transactions
    on Pattern Analysis and Machine Intelligence (2023) 11
62. Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen,
    X.: T2m-gpt: Generating human motion from textual descriptions with discrete
    representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision
    and Pattern Recognition (CVPR) (2023) 12, 24
63. Zhu, H., Zuo, X., Wang, S., Cao, X., Yang, R.: Detailed human shape estima-
    tion from a single image by hierarchical mesh deformation. In: Proceedings of the
    IEEE/CVF conference on computer vision and pattern recognition. pp. 4491–4500
    (2019) 5
64. Zwitserlood, I., Verlinden, M., Ros, J., Van Der Schoot, S., Netherlands, T.: Syn-
    thetic signing for the deaf: Esign. In: Proceedings of the conference and workshop
    on assistive technologies for vision and hearing impairment (CVHI) (2004) 4, 26
                                                                 SignAvatars       19
We provide further details of our SignAvatars dataset and present more visu-
alization of our data in Fig. 9, Figs. 10 and 11. Being the first large-scale
multi-prompt 3D sign language (SL) motion dataset with accurate holistic mesh
representations, our dataset enables various tasks such as 3D sign language
20      Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
Fig. 10: More HamNoSys-level examples of SignAvatars, HamN oSys subset. We have
different shapes of annotations presenting the accurate body and hand estimation.
recognition (SLR) and the novel 3D SL production (SLP) from diverse inputs
like text scripts, individual words, and HamNoSys notation. We also provide a
demo video in the supplementary materials and our anonymous project page:
https://anonymoususer4ai.github.io/.
Fig. 11: More word-level examples of SignAvatars, word subset. We have different
shapes of annotations presenting the accurate body and hand estimation.
– First, we generate mesh annotations for the Phoenix-2014T dataset and add
  them as our subsets GSL. We follow the original data distribution and official
  split to train our network.
– Second, because in addition to the absence of the evaluation model, the gen-
  eration model checkpoints are also lacking, we re-train PT using the official
  implementation on both 3D-lifted OpenPose keypoints JP T and the 3D key-
  points Jours regressed from our mesh representation, corresponding to PT
  (JP T ) and PT (Jours ).
– Third, we train two 3D keypoints-based SL motion evaluation models on
  this subset with JP T and Jours , resulting in two model checkpoints CP T and
  Cours .
Furthermore, we also qualitative comparison results in Fig. 15. Please see more
visualizations in our supplementary video, and project page.
Discussion. With SignAvatars, our goal is to provide an up-to-date, publicly
available 3D holistic mesh motion-based SLP benchmark and we invite the
community to participate. As an alternative for the re-evaluation, we can also
develop a brand new 3D sign language translation (SLT) method to re-evaluate
PT and compare it with our method on BLEU and ROUGE. As a part of our fu-
ture work on SL understanding, we also encourage the SL community to develop
back-translation and mesh-based SLT methods trained with our benchmark. We
believe that the 3D holistic mesh representation presents significant improve-
ments for the accurate SL-motion correlation understanding, compared to the
pure 2D methods as shown in Tab. 4 and Tab. 5 of the main paper, which was
also proved to be true in a latest 3D SLT work [29].
where µgt , µpred are the mean values for the features of real motion and generated
motion, separately. C, T r are the covariance matrix and trace of a matrix.
Diversity is used for evaluating the variance of the generated SL motion. Specifi-
                                                                    ′
cally, we randomly sample ND = 300 motion feature pairs {fm , fm } and compute
the average Euclidean distance between them by:
be written as:
E     Discussion
E.1   Related Work
In this section, we present more details about the related work as well as the
open problems. Background. Existing SL datasets, and dictionaries are typi-
cally limited to 2D, which is ambiguous and insufficient for learners as introduced
in [29], different signs could appear to be the same in 2D domain due to depth
ambiguity. In that, 3D avatars and dictionaries are highly desired for efficient
learning [37], teaching, and many downstream tasks. However, The creation of
3D avatar annotation for SL is a labor-intensive, entirely manual process con-
ducted by SL experts and the results are often unnatural [3]. As a result, there is
not a unified large-scale multi-prompt 3D sign language holistic motion dataset
with precise hand mesh annotations. The lack of such 3D avatar data is a huge
barrier to bringing these meaningful applications to Deaf community, such as
3D sign language production (SLP), 3D sign language recognition (SLR), and
many downstream tasks such as digital simultaneous translators between spoken
language and sign language in VR/AR.
Open problems. Overall, the open problems chain is: 1) Current 3D avatar
annotation methods for sign language are mostly done manually by SL experts
and are labor-intensive. 2) Lack of generic automatic 3D expressive avatar an-
notation methods with detailed hand pose. 3) Due to the lack of a generic
annotation method, there is also a lack of a unified large-scale multi-prompt 3D
co-articulated continuous sign language holistic motion dataset with precise hand
mesh annotations. 4) Due to the above constraints, it is difficult to extend sign
language applications to highly desired 3D properties such as 3D SLR, 3D SLP,
which can be used for many downstream applications like virtual simultaneous
SL translators, 3D dictionaries, etc.
According to the problem chain, we will introduce the SoTA from three aspects:
3D holistic mesh annotation pipeline, 3D sign language motion dataset, and 3D
SL applications.
3D holistic mesh annotation: There are a lot of prior works for reconstruct-
ing holistic human body from RGB images with parametric models like SMPL-
X [39], Adam [23]. Among them, TalkSHOW [59] proposes a fitting pipeline
26     Zhengdi Yu , Shaoli Huang , Yongkang Cheng , and Tolga Birdal
based on SMPLify-X [39] with a photometric loss for facial details. OSX [32]
proposes a time-consuming finetune-based weakly supervision pipeline to gener-
ate pseudo-3D holistic annotations. However, such expressive parametric models
have rarely been applied to the SL domain. [27] use off-the-shelf methods to
estimate holistic 3D mesh on the GSLL sign-language dataset [55]. In addition
to that, only a concurrent work [14] can reconstruct 3D holistic mesh annotation
using linguistic priors with group labels obtained from a sign-classifier trained
on Corpus-based Dictionary of Polish Sign Language (CDPSL) [33], which is
annotated with HamNoSys As such, it utilizes an existing sentence segmenta-
tion methods [42] to generalize to multiple-sign videos. These methods cannot
deal with the challenging self-occlusion, hand–hand and hand–body interactions
which makes them insufficient for complex interacting hand scenarios such as
sign language. There is not yet a generic annotation pipeline that is sufficient to
deal with complex interacting hand cases in continuous and co-articulated
SL videos.
Sign language datasets. While there have been many well-organized continu-
ous SL motion datasets [1, 2, 7, 9, 17, 20] with 2D videos or 2D keypoints annota-
tions, the only existing 3D SL motion dataset with 3D holistic mesh annotation
is in [14], which is purely isolated sign based and not sufficient for tackling
real-world applications in natural language scenarios. There is not yet a unified
large-scale multi-prompt 3D SL holistic motion dataset with continuous and
co-articulated signs and precise hand mesh annotations.
SL applications. Regarding the SL applications, especially sign language pro-
duction (SLP), [4] can generate 2D motion sequences from HamNoSys. [47]
and [49] are able to generate 3D keypoint sequences with glosses. The avatar
approaches are often hand-crafted and produce robotic and unnatural move-
ments. Apart from them, there are also early avatar approaches [5, 10, 11, 15, 64]
with a pre-defined protocol and character.
E.2 Licensing
Fig. 13: Our 3D holistic human mesh reconstruction methods on in-the-wild cases.
(Zoom in for a better view)
  Input       “in der nacht zehn bis sechzehn grad in einigen mittelgebirgstälern nur
              einstellige werte.”
PT
Ours
 Ground
  Truth
  Video
              “und auch am samstag im osten noch freundlich im westen dann zum teil
 Input        kräftige schauer.”
PT
Ours
Ground
 Truth
 Video
Input “am montag mal sonne mal wolken mit nur wenigen schauern.”
PT
Ours
Ground
 Truth
 Video