Linguistically Motivated Sign Language Segmentation
Linguistically Motivated Sign Language Segmentation
Abstract O O B I I B I I I O B I I I O O
sign 1 sign 2 sign 3
Sign language segmentation is a crucial task phrase
in sign language processing systems. It en- O O B I I I I I I I I I I I O O
ables downstream tasks such as sign recogni-
arXiv:2310.13960v2 [cs.CL] 30 Oct 2023
tion, transcription, and machine translation. In Figure 1: Per-frame classification of a sign language
this work, we consider two kinds of segmen- utterance following a BIO tagging scheme. Each box
tation: segmentation into individual signs and represents a single frame of a video. We propose a joint
segmentation into phrases, larger units compris- model to segment signs (top) and phrases (bottom) at
ing several signs. We propose a novel approach the same time. B=beginning, I=inside, O=outside. The
to jointly model these two tasks. figure illustrates continuous signing where signs often
Our method is motivated by linguistic cues ob- follow each other without an O frame between them.
served in sign language corpora. We replace the
predominant IO tagging scheme with BIO tag-
ging to account for continuous signing. Given Sign language transcription and translation sys-
that prosody plays a significant role in phrase tems rely on the accurate temporal segmentation of
boundaries, we explore the use of optical flow sign language videos into meaningful units such as
features. We also provide an extensive analysis signs (Santemiz et al., 2009; Renz et al., 2021a) or
of hand shapes and 3D hand normalization. signing sequences corresponding to subtitle units1
We find that introducing BIO tagging is nec- (Bull et al., 2020b). However, sign language seg-
essary to model sign boundaries. Explicitly mentation remains a challenging task due to the
encoding prosody by optical flow improves seg- difficulties in defining meaningful units in signed
mentation in shallow models, but its contribu- languages (De Sisto et al., 2021). Our approach
tion is negligible in deeper models. Careful
is the first to consider two kinds of units in one
tuning of the decoding algorithm atop the mod-
els further improves the segmentation quality. model. We simultaneously segment single signs
and phrases (larger units) in a unified framework.
We demonstrate that our final models gener-
Previous work typically approached segmenta-
alize to out-of-domain video content in a dif-
ferent signed language, even under a zero-shot tion as a binary classification task (including seg-
setting. We observe that including optical flow mentation tasks in audio signal processing and com-
and 3D hand normalization enhances the ro- puter vision), where each frame/pixel is predicted
bustness of the model in this context. to be either part of a segment or not. However, this
approach neglects the intricate nuances of contin-
1 Introduction uous signing, where segment boundaries are not
strictly binary and often blend in reality. One sign
Signed languages are natural languages used by
or phrase can immediately follow another, transi-
deaf and hard-of-hearing individuals to commu-
tioning smoothly, without a frame between them
nicate through a combination of manual and non-
being distinctly outside (Figure 1 and §3.1).
manual elements (Sandler and Lillo-Martin, 2006).
We propose incorporating linguistically moti-
Like spoken languages, signed languages have their
vated cues to address these challenges and improve
own distinctive grammar, and vocabulary, that have
sign language segmentation. To cope with contin-
evolved through natural processes of language de-
velopment (Sandler, 2010). 1
Subtitles may not always correspond directly to sentences.
They frequently split within a sentence and could be tempo-
*
Equal contribution authors. rally offset from the corresponding signing segments.
Figure 2: The annotation of the first phrase in a video from the test set (dgskorpus_goe_02), in yellow, signing:
“Why do you smoke?” through the use of three signs: WHY (+mouthed), TO-SMOKE, and a gesture (+mouthed)
towards the other signer. At the top, our phrase segmentation model predicts a single phrase that initiates with a B
tag (in green) above the B-threshold (green dashed line), followed by an I (in light blue), and continues until falling
below a certain threshold. At the bottom, our sign segmentation model accurately segments the three signs.
body
face
left
right
body
face
left
right
Figure 3: Optical flow for a conversation between two signers (signer 1 top, signer 2 bottom). The x-axis is the
progression across 30 seconds. The yellow marks the annotated phrase spans. (Source: Moryossef et al. (2020))
the model to extract the hand shape. We expand on 4.2 Methodology
this process and show examples in Appendix C. Our proposed approach for sign language segmen-
tation is based on the following steps:
4 Experimental Setup
1. Pose Estimation Given a video, we first ad-
In this section, we describe the experimental setup just it to 25 fps and estimate body poses using
used to evaluate our linguistically motivated ap- the MediaPipe Holistic pose estimation sys-
proach for sign language segmentation. This in- tem. We do not use OpenPose because it lacks
cludes a description of the Public DGS Corpus a Z-axis, which prevents 3D rotation used for
dataset used in the study, the methodology em- hand normalization. The shape of a pose is
ployed to perform sign and phrase segmentation, represented as (frames × keypoints × axes).
and the evaluation metrics used to measure the per-
formance of the proposed approach. 2. Pose Normalization To generalize over video
resolution and distance from the camera, we
4.1 Dataset normalize each of these poses such that the
mean distance between the shoulders of each
The Public DGS Corpus (Prillwitz et al., 2008; person equals 1, and the mid-point is at (0, 0)
Hanke et al., 2020) is a distinctive sign language (Celebi et al., 2013). We also remove the legs
dataset that includes both accurate sign-level anno- since they are less relevant to signing.
tation from continuous signing, and well-aligned
phrase-level translation in spoken language. 3. Optical Flow We follow the equation in
The corpus comprises 404 documents / 714 Moryossef et al. (2020, Equation 1).
videos4 with an average duration of 7.55 minutes,
4. 3D Hand Normalization We rotate and scale
featuring either one signer or two signers, at 50 fps.
each hand to ensure that the same hand shape
Most of these videos feature gloss transcriptions
is represented in a consistent manner across
and spoken language translations (German and En-
different frames. We rotate the 21 XY Z key-
glish), except for the ones in the “Joke” category,
points of the hand so that the back of the hand
which are not annotated and thus excluded from
lies on the XY plane, we then rotate the hand
our model5 . The translations are comprised of full
so that the metacarpal bone of the middle fin-
spoken language paragraphs, sentences, or phrases
ger lies on the Y -axis, and finally, we scale the
(i.e., independent/main clauses).
hand such that the bone is of constant length.
Each gloss span is considered a gold sign seg-
Visualizations are presented in Appendix C.
ment, following a tight annotation scheme (Hanke
et al., 2012). Phrase segments are identified by 5. Sequence Encoder For every frame, the pose
examining every translation, with the segment as- is first flattened and projected into a standard
sumed to span from the start of its first sign to the dimension (256), then fed through an LSTM
end of its last sign, correcting imprecise annotation. encoder (Hochreiter and Schmidhuber, 1997).
This corpus is enriched with full-body pose esti-
mations from OpenPose (Cao et al., 2019; Schul- 6. BIO Tagging On top of the encoder, we place
der and Hanke, 2019) and Mediapipe Holistic (Gr- two BIO classification heads for sign and
ishchenko and Bazarevsky, 2020). We use the phrase independently. B denotes the begin-
3.0.0-uzh-document split from Zhang et al. (2023). ning of a sign or phrase, I denotes the middle
After filtering the unannotated data, we are left with of a sign or phrase, and O denotes being out-
296 documents / 583 videos for training, 6 / 12 for side a sign or phrase. Our cross-entropy loss
validation, and 9 / 17 for testing. The mean number is proportionally weighted in favor of B as it
of signs and phrases in a video from the training is a rare label6 compared to I and O.
set is 613 and 111, respectively.
7. Greedy Segment Decoding To decode the
4 frame-level BIO predictions into sign/phrase
The number of videos is nearly double the number of
documents because each document typically includes two segments, we define a segment to start with
signers, each of whom produces one video for segmentation. the first frame possessing a B probability
5
We also exclude documents with missing annotations.
6
id ∈ {1289910, 1245887, 1289868, 1246064, 1584617} B:I:O is about 1:5:18 for signs and 1:58:77 for phrases.
greater than a predetermined threshold (de- 6. E5: Autoregressive Encoder We replace the
faulted at 0.5). The segment concludes with encoder with the one proposed by Jiang et al.
the first frame among the subsequent frames, (2023) for the detection and classification of
having either a B or O probability exceeding great ape calls from raw audio signals. Specif-
the threshold. We provide the pseudocode of ically, we add autoregressive connections be-
the decoding algorithm in Appendix B. tween time steps to encourage consistent out-
put labels. The logits at time step t are con-
4.3 Experiments catenated to the input of the next time step,
Our approach is evaluated through a series of six t + 1. This modification is implemented bidi-
sets of experiments. Each set is repeated three rectionally by stacking two autoregressive en-
times with varying random seeds. Preliminary coders and adding their output up before the
experiments were conducted to inform the selec- Softmax operation. However, this approach is
tion of hyperparameters and features, the details inherently slow, as we have to fully wait for
of which can be found in Table 3 in Appendix A. the previous time step predictions before we
Model selection is based on validation metrics. can feed them to the next time step.
4. E3: Adding Optical Flow At every time step • Percentage of Segments (%) To comple-
t we append the optical flow between t and ment IoU, we introduce the percentage of
t−1 to the current pose frame as an additional segments to compare the number of seg-
dimension after the XY Z axes. ments predicted by the model with the ground
truth annotations. It is computed as follows:
#predicted segments
5. E4: Adding 3D Hand Normalization At ev- #ground truth segments . The optimal value is 1.
ery time step, we normalize the hand poses
• Efficiency We measure the efficiency of each
and concatenate them to the current pose.
model by the number of parameters and the
7
The initial implementation uses OpenPose (Cao et al., training time of the model on a Tesla V100-
2019), at 50 fps. Preliminary experiments reveal that these SXM2-32GB GPU for 100 epochs9 .
differences do not significantly influence the results.
8 9
We reduce the dense FACE_LANDMARKS in Mediapipe Exceptionally the autoregressive models in E5 were
Holistic to the contour keypoints according to the variable trained on an NVIDIA A100-SXM4-80GB GPUA100 which
mediapipe.solutions.holistic.FACEMESH_CONTOURS. doubles the training speed of V100, still the training is slow.
Sign Phrase Efficiency
Experiment F1 IoU % F1 IoU % #Params Time
E0 Moryossef et al. (2020) — 0.46 1.09 — 0.70 1.00 102K 0:50:17
E1 Baseline 0.56 0.66 0.91 0.59 0.80 2.50 454K 1:01:50
E2 E1 + Face 0.53 0.58 0.64 0.57 0.76 1.87 552K 1:50:31
E3 E1 + Optical Flow 0.58 0.62 1.12 0.60 0.82 3.19 473K 1:20:17
E4 E3 + Hand Norm 0.56 0.61 1.07 0.60 0.80 3.24 516K 1:30:59
E1s E1 + Depth=4 0.63 0.69 1.11 0.65 0.82 1.63 1.6M 4:08:48
E2s E2 + Depth=4 0.62 0.69 1.07 0.63 0.84 2.68 1.7M 3:14:03
E3s E3 + Depth=4 0.60 0.63 1.13 0.64 0.80 1.53 1.7M 4:08:30
E4s E4 + Depth=4 0.59 0.63 1.13 0.62 0.79 1.43 1.7M 4:35:29
E1s* E1s + Tuned Decoding — 0.69 1.03 — 0.85 1.02 — —
E4s* E4s + Tuned Decoding — 0.63 1.06 — 0.79 1.12 — —
E5 E4s + Autoregressive 0.45 0.47 0.88 0.52 0.63 2.72 1.3M ~3 days
Table 1: Mean test evaluation metrics for our experiments. The best score of each column is in bold and a star (*)
denotes further optimization of the decoding algorithm without changing the model (only affects IoU and %). Table
4 in Appendix A contains a complete report including validation metrics and standard deviation of all experiments.
Probability density
pose estimation system. The evaluation metrics we Gold segments
propose in Appendix C could help identify better 0.4
pose estimation models for this use case.
0.3
5.2 Tuning the Segment Decoding Algorithm
0.2
We selected E1s and E4s to further explore the
segment decoding algorithm. As detailed in §4.2 0.1
and Appendix B, the decoding algorithm has two 0.0
tunable parameters, thresholdb and thresholdo . 0 5 10 15
We conducted a grid search with these parameters, Length in seconds
using values from 10 to 90 in increments of 10. We
Figure 6: Probability density of phrase segment lengths.
additionally experimented with a variation of the
algorithm that conditions on the most likely class
by argmax instead of fixed threshold values, which sign to the end of its last sign) is tighter than the
turned out similar to the default version. subtitle units used in their paper and that we use
We only measured the results using IoU and the different training datasets of different languages
percentage of segments at validation time since and domains. Nevertheless, we implemented some
the F1 scores remain consistent in this case. For of their frame-level metrics and show the results
sign segmentation, we found using thresholdb = in Table 2 on both the Public DGS Corpus and
60 and thresholdo = 40/50/60 yields slightly the MEDIAPI-SKEL dataset (Bull et al., 2020a)
better results than the default setting (50 for both). in French Sign Language (LSF). We report both
For phrase segmentation, we identified that higher zero-shot out-of-domain results11 and the results
threshold values (thresholdb = 90, thresholdo = of our models trained specifically on their dataset
90 for E1s and thresholdb = 80, thresholdo = without the spatio-temporal graph convolutional
80/90 for E4s) improve on the default significantly, network (ST-GCN) (Yan et al., 2018) used in their
especially on the percentage metric. We report the work for pose encoding.
test results under E1s* and E4s*, respectively.
Despite formulating a single model, we under-
Data Model ROC-AUC F1-M
line a separate sign/phrase model selection process
to archive the best segmentation results. Figure 6 full (theirs) 0.87 —
illustrates how higher threshold values reduce the body (theirs) 0.87 —
number of predicted segments and skew the distri- LSF E1s (ours, zero-shot) 0.71 0.41
bution of predicted phrase segments towards longer E4s (ours, zero-shot) 0.76 0.44
ones in E1s/E1s*. As Bull et al. (2020b) suggest, E1s (ours, trained) 0.87 0.49
advanced priors could also be integrated into the E4s (ours, trained) 0.87 0.51
decoding algorithm. E1s (ours) 0.91 0.65
DGS
E4s (ours) 0.90 0.62
5.3 Comparison to Previous Work
We re-implemented and re-purposed the sign lan- Table 2: Evaluation metrics used in Bull et al. (2020b).
guage detection model introduced in Moryossef ROC-AUC is applied exclusively on the O-tag. For
et al. (2020) for our segmentation task as a baseline comparison F1-M denotes the macro-averaged per-class
since their work is the state-of-the-art and the only F1 used in this work across all tags. The first two rows
are the best results taken from Table 1 in their paper.
comparable model designed for the Public DGS
The next four rows represent how our models perform
Corpus dataset. As a result, we show the necessity on their data in a zero-shot setting, and in a supervised
of replacing IO tagging with BIO tagging to tackle setting, and the last two rows represent how our models
the subtle differences between the two tasks. perform on our data.
For phrase segmentation, we compare to Bull
et al. (2020b). We note that our definition of sign 11
The zero-shot results are not directly comparable to theirs
language phrases (spanning from the start of its first due to different datasets and labeling approaches.
For sign segmentation, we do not compare to featuring white participants.
Renz et al. (2021a,b) due to different datasets and
the difficulty in reproducing their segment-level Encoding of Long Sequences
evaluation metrics. The latter depends on the de- In this study, we encode sequences of frames that
coding algorithm and a way to match the gold and are significantly longer than the typical 512 frames
predicted segments, both of which are variable. often seen in models employing Transformers
(Vaswani et al., 2017). Numerous techniques, rang-
6 Conclusions ing from basic temporal pooling/downsampling to
This work focuses on the automatic segmentation more advanced methods such as a video/pose en-
of signed languages. We are the first to formulate coder that aggregates local frames into higher-level
the segmentation of individual signs and larger sign ‘tokens’ (Renz et al., 2021a), graph convolutional
phrases as a joint problem. networks (Bull et al., 2020b), and self-supervised
We propose a series of improvements over previ- representations (Baevski et al., 2020), can alleviate
ous work, linguistically motivated by careful anal- length constraints, facilitate the use of Transform-
yses of sign language corpora. Recognizing that ers, and potentially improve the outcomes. More-
sign language utterances are typically continuous over, a hierarchical method like the Swin Trans-
with minimal pauses, we opted for a BIO tagging former (Liu et al., 2021) could be applicable.
scheme over IO tagging. Furthermore, leverag- Limitations of Autoregressive LSTMs
ing the fact that phrase boundaries are marked by
prosodic cues, we introduce optical flow features as In this paper, we replicated the autoregressive
a proxy for prosodic processes. Finally, since signs LSTM implementation originally proposed by
typically employ a limited number of hand shapes, Jiang et al. (2023). Our experiments revealed that
to make it easier for the model to understand hand- this implementation exhibits significant slowness,
shapes, we attempt 3D hand normalization. which prevented us from performing further exper-
Our experiments conducted on the Public DGS imentation. In contrast, other LSTM implemen-
Corpus confirmed the efficacy of these modifica- tations employed in this project have undergone
tions for segmentation quality. By comparing to extensive optimization (Appleyard, 2016), includ-
previous work in a zero-shot setting, we demon- ing techniques like combining general matrix mul-
strate that our models generalize across signed lan- tiplication operations (GEMMs), parallelizing in-
guages and domains and that including linguisti- dependent operations, fusing kernels, rearranging
cally motivated cues leads to a more robust model matrices, and implementing various optimizations
in this context. for models with multiple layers (which are not nec-
Finally, we envision that the proposed model has essarily applicable here).
applications in real-world data collection for signed A comparison of CPU-based performance
languages. Furthermore, a similar segmentation demonstrates that our implementation is x6.4 times
approach could be leveraged in various other fields slower. Theoretically, the number of operations
such as co-speech gesture recognition (Moryossef, performed by the autoregressive LSTM is equiv-
2023) and action segmentation (Tang et al., 2019). alent to that of a regular LSTM. However, while
the normal LSTM benefits from concurrency based
Limitations on the number of layers, we do not have that lux-
ury. The optimization of recurrent neural networks
Pose Estimation (RNNs) (Que et al., 2020, 2021, 2022) remains an
In this work, we employ the MediaPipe Holis- ongoing area of research. If proven effective in
tic pose estimation system (Grishchenko and other domains, we strongly advocate for efforts to
Bazarevsky, 2020). There is a possibility that optimize the performance of this type of network.
this system exhibits bias towards certain protected
classes (such as gender or race), underperforming Interference Between Sign and Phrase Models
in instances with specific skin tones or lower video In our model, we share the encoder for both the sign
quality. Thus, we cannot attest to how our system and phrase segmentation models, with a shallow
would perform under real-world conditions, given linear layer for the BIO tag prediction associated
that the videos utilized in our research are gener- with each task. It remains uncertain whether these
ated in a controlled studio environment, primarily two tasks interfere with or enhance each other. An
ablation study (not presented in this work) involv- data from automatic speech recognition / spoken
ing separate modeling is necessary to obtain greater language translation (Tsiamas et al., 2022).
insight into this matter. However, for signed languages, there is nei-
ther a standardized and widely used written form
Noisy Training Objective nor a reliable transcription procedure into some
Although the annotations utilized in this study are potential writing systems like SignWriting (Sut-
of expert level, the determination of precise sign ton, 1990), HamNoSys (Prillwitz and Zienert,
(Hanke et al., 2012) and phrase boundaries remains 1990), and glosses (Johnston, 2008). Transcrip-
a challenging task, even for experts. Training the tion/recognition and segmentation tasks need to be
model on these annotated boundaries might intro- solved simultaneously, so we envision that a multi-
duce excessive noise. A similar issue was observed task setting helps. Sign spotting, the localization of
in classification-based pose estimation (Cao et al., a specific sign in continuous signing, is a simplifi-
2019). The task of annotating the exact anatomical cation of the segmentation and recognition problem
centers of joints proves to be nearly impossible, in a closed-vocabulary setting (Wong et al., 2022;
leading to a high degree of noise when predicting Varol et al., 2022). It can be used to find candidate
joint position as a 1-hot classification task. The boundaries for some signs, but not all.
solution proposed in this previous work was to dis-
tribute a Gaussian around the annotated location Acknowledgements
of each joint. This approach allows the joint’s cen- This work was funded by the EU Horizon 2020
ter to overlap with some probability mass, thereby project EASIER (grant agreement no. 101016982),
reducing the noise for the model. A similar solu- the Swiss Innovation Agency (Innosuisse) flag-
tion could be applied in our context. Instead of ship IICT (PFFS-21-47) and the EU Horizon 2020
predicting a strict 0 or 1 class probability, we could project iEXTRACT (grant agreement no. 802774).
distribute a Gaussian around the boundary. We also thank Rico Sennrich and Chantal Amrhein
for their suggestions.
Naive Segment Decoding
We recognize that the frame-level greedy decod-
ing strategy implemented in our study may not References
be optimal. Previous research in audio segmen- Jeremy Appleyard. 2016. Optimizing recurrent neural
tation (Venkatesh et al., 2022) employed a You networks in cuDNN 5. Accessed: 2023-06-09.
Only Look Once (YOLO; Redmon et al. (2015)) de-
coding scheme to predict segment boundaries and Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework
classes. We propose using a similar prediction atop for self-supervised learning of speech representations.
a given representation, such as the LSTM output or Advances in neural information processing systems,
classification logits of an already trained network. 33:12449–12460.
Differing from traditional object detection tasks,
this process is simplified due to the absence of a Mark Borg and Kenneth P Camilleri. 2019. Sign lan-
guage detection "in the wild" with recurrent neural
Y axis and non-overlapping segments. In this sce- networks. In ICASSP 2019-2019 IEEE International
nario, the network predicts the segment boundaries Conference on Acoustics, Speech and Signal Process-
using regression, thereby avoiding the class imbal- ing (ICASSP), pages 1637–1641. IEEE.
ance issue of the BIO tagging. We anticipate this
Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel
to yield more accurate sign language segmentation. Albanie, Liliane Momeni, and Andrew Zisserman.
2021. Aligning subtitles in sign language videos. In
Lack of Transcription Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 11552–11561.
Speech segmentation is a close task to our sign
language segmentation task on videos. In addi- Hannah Bull, Annelies Braffort, and Michèle Gouif-
tion to relying on prosodic cues from audio, the fès. 2020a. MEDIAPI-SKEL - a 2D-skeleton video
former could benefit from automatic speech tran- database of French Sign Language with aligned
French subtitles. In Proceedings of the Twelfth Lan-
scription systems, either in terms of surrogating guage Resources and Evaluation Conference, pages
the task to text-level segmentation and punctuation 6063–6068, Marseille, France. European Language
(Cho et al., 2015), or gaining additional training Resources Association.
Hannah Bull, Michèle Gouiffès, and Annelies Braffort. Ivan Grishchenko and Valentin Bazarevsky. 2020. Me-
2020b. Automatic segmentation of sign language diapipe holistic.
into subtitle-units. In European Conference on Com-
puter Vision, pages 186–198. Springer. Thomas Hanke, Silke Matthes, Anja Regen, and Satu
Worseck. 2012. Where does a sign start and end?
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. segmentation of continuous signing. In Language
Sheikh. 2019. OpenPose: Realtime multi-person Resources and Evaluation Conference.
2D pose estimation using part affinity fields. IEEE
Transactions on Pattern Analysis and Machine Intel- Thomas Hanke, Marc Schulder, Reiner Konrad, and
ligence. Elena Jahn. 2020. Extending the Public DGS Corpus
Sait Celebi, Ali Selman Aydin, Talha Tarik Temiz, and in size and depth. In Proceedings of the LREC2020
Tarik Arici. 2013. Gesture recognition using skeleton 9th Workshop on the Representation and Processing
data with weighted dynamic time warping. In Inter- of Sign Languages: Sign Language Resources in the
national Conference on Computer Vision Theory and Service of the Language Community, Technological
Applications. Challenges and Application Perspectives, pages 75–
82, Marseille, France. European Language Resources
Eunah Cho, Jan Niehues, Kevin Kilgour, and Alex Association (ELRA).
Waibel. 2015. Punctuation insertion for real-time
spoken language translation. In Proceedings of the Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
12th International Workshop on Spoken Language short-term memory. Neural computation, 9(8):1735–
Translation: Papers, pages 173–179. 1780.
Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- Zifan Jiang, Adrian Soldati, Isaac Schamberg, Adri-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger ano R Lameira, and Steven Moran. 2023. Auto-
Schwenk, and Yoshua Bengio. 2014. Learning matic sound event detection and classification of
phrase representations using RNN encoder–decoder great ape calls using neural networks. arXiv preprint
for statistical machine translation. In Proceedings arXiv:2301.02214.
of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1724– Trevor Alexander Johnston. 2008. From archive to
1734, Doha, Qatar. Association for Computational corpus: transcription and annotation in the creation of
Linguistics. signed language corpora. In Pacific Asia Conference
Mathieu De Coster, Ellen Rushe, Ruth Holmes, An- on Language, Information and Computation.
thony Ventresque, and Joni Dambre. 2023. Towards
the extraction of robust sign embeddings for low re- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
source sign language recognition. arXiv preprint Zheng Zhang, Stephen Lin, and Baining Guo. 2021.
arXiv:2306.17558. Swin transformer: Hierarchical vision transformer
using shifted windows. In Proceedings of the
Mirella De Sisto, Dimitar Shterionov, Irene Murtagh, IEEE/CVF international conference on computer vi-
Myriam Vermeerbergen, and Lorraine Leeson. 2021. sion, pages 10012–10022.
Defining meaningful units. challenges in sign seg-
mentation and segment-meaning mapping (short pa- Mark Alan Mandel. 1981. Phonotactics and mor-
per). In Proceedings of the 1st International Work- phophonology in American Sign Language. Uni-
shop on Automatic Translation for Signed and Spoken versity of California, Berkeley.
Languages (AT4SSL), pages 98–103, Virtual. Associ-
ation for Machine Translation in the Americas. Amit Moryossef. 2023. Addressing the blind spots
in spoken language processing. arXiv preprint
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and arXiv:2309.06572.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni,
standing. In Proceedings of the 2019 Conference of Sarah Ebling, and Srini Narayanan. 2020. Real-time
the North American Chapter of the Association for sign-language detection using human pose estimation.
Computational Linguistics: Human Language Tech- In Computer Vision–ECCV 2020 Workshops: Glas-
nologies, Volume 1 (Long and Short Papers), pages gow, UK, August 23–28, 2020, Proceedings, Part
4171–4186, Minneapolis, Minnesota. Association for II 16, SLRTP 2020: The Sign Language Recogni-
Computational Linguistics. tion, Translation and Production Workshop, pages
Iva Farag and Heike Brock. 2019. Learning motion dis- 237–248.
fluencies for automatic sign language segmentation.
In ICASSP 2019-2019 IEEE International Confer- Carol Neidle, Ashwin Thangali, and Stan Sclaroff. 2012.
ence on Acoustics, Speech and Signal Processing Challenges in development of the American Sign
(ICASSP), pages 7360–7364. IEEE. Language lexicon video dataset (ASSLVD) corpus.
In 5th workshop on the representation and processing
Adam Frost and Valerie Sutton. 2022. SignWriting of sign languages: interactions between corpus and
Hand Symbols. SignWriting Press. Lexicon, LREC. Citeseer.
Ellen Ormel and Onno Crasborn. 2012. Prosodic corre- voice activity detection algorithms using long-term
lates of sentences in signed languages: A literature speech information. Speech communication, 42(3-
review and suggestions for new types of studies. Sign 4):271–287.
Language Studies, 12(2):279–315.
Lance Ramshaw and Mitch Marcus. 1995. Text chunk-
Abhilash Pal, Stephan Huber, Cyrine Chaabani, Alessan- ing using transformation-based learning. In Third
dro Manzotti, and Oscar Koller. 2023. On the impor- Workshop on Very Large Corpora.
tance of signer overlap for sign language detection.
arXiv preprint arXiv:2303.10782. Joseph Redmon, Santosh Kumar Divvala, Ross B. Gir-
shick, and Ali Farhadi. 2015. You only look once:
Adam Paszke, Sam Gross, Francisco Massa, Adam Unified, real-time object detection. 2016 IEEE Con-
Lerer, James Bradbury, Gregory Chanan, Trevor ference on Computer Vision and Pattern Recognition
Killeen, Zeming Lin, Natalia Gimelshein, Luca (CVPR), pages 779–788.
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te- Katrin Renz, Nicolaj C Stache, Samuel Albanie, and
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Gül Varol. 2021a. Sign language segmentation with
Junjie Bai, and Soumith Chintala. 2019. PyTorch: temporal convolutional networks. In ICASSP 2021-
An imperative style, high-performance deep learning 2021 IEEE International Conference on Acoustics,
library. In Advances in Neural Information Process- Speech and Signal Processing (ICASSP), pages 2135–
ing Systems 32, pages 8024–8035. Curran Associates, 2139. IEEE.
Inc.
Katrin Renz, Nicolaj C Stache, Neil Fox, Gul Varol,
Siegmund Prillwitz, Thomas Hanke, Susanne König, and Samuel Albanie. 2021b. Sign segmentation with
Reiner Konrad, Gabriele Langer, and Arvid Schwarz. changepoint-modulated pseudo-labelling. In Pro-
2008. Dgs corpus project–development of a cor- ceedings of the IEEE/CVF Conference on Computer
pus based electronic dictionary german sign lan- Vision and Pattern Recognition, pages 3403–3412.
guage/german. In sign-lang at LREC 2008, pages
Wendy Sandler. 2010. Prosody and syntax in sign lan-
159–164. European Language Resources Association
guages. Transactions of the philological society,
(ELRA).
108(3):298–328.
Siegmund Prillwitz and Heiko Zienert. 1990. Hamburg
Wendy Sandler and Diane Lillo-Martin. 2006. Sign
notation system for sign language: Development of
language and linguistic universals. Cambridge Uni-
a sign writing with computer application. In Cur-
versity Press.
rent trends in European Sign Language Research.
Proceedings of the 3rd European Congress on Sign Wendy Sandler, Peter Macneilage, Barbara Davis, Kris-
Language Research, pages 355–379. tine Zajdo, New York, and Taylor Francis. 2008. The
syllable in sign language: Considering the other nat-
Zhiqiang Que, Hiroki Nakahara, Hongxiang Fan, He Li,
ural language modality.
Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, Eriko
Nurvitadhi, and Wayne W. C. Luk. 2022. Remarn: A Pinar Santemiz, Oya Aran, Murat Saraclar, and Lale
reconfigurable multi-threaded multi-core accelerator Akarun. 2009. Automatic sign segmentation from
for recurrent neural networks. ACM Transactions on continuous signing via multiple sequence alignment.
Reconfigurable Technology and Systems, 16:1 – 26. In 2009 IEEE 12th International Conference on Com-
puter Vision Workshops, ICCV Workshops, pages
Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, 2001–2008. IEEE.
Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Xinyu
Niu, and Wayne W. C. Luk. 2020. Optimizing Marc Schulder and Thomas Hanke. 2019. OpenPose in
reconfigurable recurrent neural networks. 2020 the Public DGS Corpus. Project Note AP06-2019-
IEEE 28th Annual International Symposium on 01, DGS-Korpus project, IDGS, Hamburg University,
Field-Programmable Custom Computing Machines Hamburg, Germany.
(FCCM), pages 10–18.
Zed Sevcikova Sehyr, Naomi Caselli, Ariel M Cohen-
Zhiqiang Que, Erwei Wang, Umar Marikar, Eric Goldberg, and Karen Emmorey. 2021. The ASL-
Moreno, Jennifer Ngadiuba, Hamza Javed, LEX 2.0 project: A database of lexical and phono-
Bartłomiej Borzyszkowski, Thea Klaeboe Aarrestad, logical properties for 2,723 signs in american sign
Vladimir Loncar, Sioni Paris Summers, Maurizio language. The Journal of Deaf Studies and Deaf
Pierini, Peter Y. K. Cheung, and Wayne W. C. Luk. Education, 26(2):263–277.
2021. Accelerating recurrent neural networks for
gravitational wave experiments. 2021 IEEE 32nd Karen Simonyan and Andrew Zisserman. 2015. Very
International Conference on Application-specific deep convolutional networks for large-scale image
Systems, Architectures and Processors (ASAP), pages recognition. CoRR.
117–124.
Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. 1999.
Javier Ramırez, José C Segura, Carmen Benıtez, Angel A statistical model-based voice activity detection.
De La Torre, and Antonio Rubio. 2004. Efficient IEEE signal processing letters, 6(1):1–3.
William C Stokoe Jr. 1960. Sign Language Structure:
An Outline of the Visual Communication Systems of
the American Deaf. The Journal of Deaf Studies and
Deaf Education, 10(1):3–37.
Valerie Sutton. 1990. Lessons in sign writing. Sign-
Writing.
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng,
Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou.
2019. COIN: A large-scale dataset for comprehen-
sive instructional video analysis. In IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR).
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
losa, and Marta R. Costa-jussà. 2022. SHAS:
Approaching optimal segmentation for end-to-end
speech translation. In Proc. Interspeech 2022, pages
106–110.
Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyl-
los Afouras, and Andrew Zisserman. 2022. Scaling
up sign spotting through sign language dictionaries.
International Journal of Computer Vision, 130:1416
– 1439.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Satvik Venkatesh, David Moffat, and Eduardo Reck
Miranda. 2022. You only hear once: a YOLO-like
algorithm for audio segmentation and sound event
detection. Applied Sciences, 12(7):3293.
Ryan Wong, Necati Cihan Camgoz, and R. Bowden.
2022. Hierarchical I3D for sign spotting. In ECCV
Workshops.
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial
temporal graph convolutional networks for skeleton-
based action recognition. In Proceedings of the AAAI
conference on artificial intelligence, volume 32.
Ting Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-
temporal graph convolutional networks: A deep learn-
ing framework for traffic forecasting. In Interna-
tional Joint Conference on Artificial Intelligence.
Biao Zhang, Mathias Müller, and Rico Sennrich. 2023.
SLTUNET: A simple unified model for sign language
translation. In The Eleventh International Confer-
ence on Learning Representations, Kigali, Rwanda.
A Extended Experimental Results
We conducted some preliminary experiments (starting with P0) on training a sign language segmentation
model to gain insights into hyperparameters and feature choices. The results are shown in Table 312 . We
found in P1.3.2 the optimal hyperparameters and repeated them with different feature choices.
Sign Phrase
Experiment F1 IoU % F1 IoU %
P0 Moryossef et al. (2020) test — 0.4 1.45 — 0.65 0.82
dev — 0.35 1.36 — 0.6 0.77
P0.1 P0 + Holistic 25fps test — 0.39 0.86 — 0.64 0.5
dev — 0.32 0.81 — 0.58 0.52
P1 P1 baseline test 0.55 0.49 0.83 0.6 0.67 2.63
dev 0.56 0.43 0.75 0.58 0.62 2.61
P1.1 P1 - encoder_bidirectional test 0.48 0.45 0.68 0.5 0.64 2.68
dev 0.46 0.41 0.64 0.51 0.61 2.56
P1.2.1 P1 + hidden_dim=512 test 0.47 0.42 0.44 0.52 0.63 1.7
dev 0.46 0.4 0.43 0.52 0.61 1.69
P1.2.2 P1 + hidden_dim=1024 test 0.48 0.45 0.42 0.58 0.65 1.53
dev 0.46 0.41 0.36 0.53 0.61 1.49
P1.3.1 P1 + encoder_depth=2 test 0.55 0.48 0.76 0.58 0.67 2.56
dev 0.56 0.43 0.69 0.58 0.62 2.52
P1.3.2 P1 + encoder_depth=4 test 0.63 0.51 0.91 0.66 0.67 1.41
dev 0.61 0.47 0.84 0.64 0.6 1.39
P1.4.1 P1 + hidden_dim=128 + encoder_depth=2 test 0.58 0.48 0.8 0.6 0.67 2.0
dev 0.55 0.43 0.75 0.54 0.62 2.03
P1.4.2 P1 + hidden_dim=128 + encoder_depth=4 test 0.62 0.51 0.91 0.64 0.68 2.43
dev 0.6 0.47 0.83 0.6 0.62 2.57
P1.4.3 P1 + hidden_dim=128 + encoder_depth=8 test 0.59 0.52 0.91 0.63 0.68 3.04
dev 0.6 0.47 0.84 0.6 0.62 3.02
P1.5.1 P1 + hidden_dim=64 + encoder_depth=4 test 0.57 0.5 0.8 0.6 0.68 2.41
dev 0.58 0.45 0.75 0.59 0.62 2.39
P1.5.2 P1 + hidden_dim=64 + encoder_depth=8 test 0.62 0.51 0.85 0.64 0.68 2.53
dev 0.6 0.46 0.79 0.6 0.62 2.53
P2 P1 + optical_flow test 0.58 0.5 0.95 0.63 0.68 3.17
dev 0.59 0.45 0.84 0.59 0.61 3.08
P2.1 P1.3.2 + optical_flow test 0.63 0.51 0.92 0.66 0.67 1.51
dev 0.62 0.46 0.81 0.62 0.6 1.53
P3 P1 + hand_normalization test 0.55 0.48 0.77 0.58 0.67 2.79
dev 0.55 0.42 0.71 0.57 0.62 2.73
P3.1 P1.3.2 + hand_normalization test 0.63 0.51 0.91 0.66 0.66 1.43
dev 0.61 0.46 0.82 0.64 0.61 1.46
P4 P2.1 + P3.1 test 0.56 0.51 0.92 0.61 0.66 1.45
dev 0.61 0.46 0.81 0.63 0.6 1.41
P4.1 P4 + encoder_depth=8 test 0.6 0.51 0.95 0.62 0.67 1.08
dev 0.61 0.47 0.86 0.62 0.6 1.12
P5 P1.3.2 + reduced_face test 0.63 0.51 0.94 0.64 0.66 1.16
dev 0.61 0.47 0.86 0.64 0.58 1.14
P5.1 P1.3.2 + full_face test 0.54 0.49 0.8 0.6 0.68 2.29
dev 0.57 0.45 0.7 0.59 0.62 2.29
12
Note that due to an implementation issue on edge cases (which we fixed later), the IoU and % values in Table 3 are lower
than the ones in Table 1 and Table 4 thus not comparable across tables. The comparison inside of Table 3 between different
experiments remains meaningful. In addition, the results in Table 3 are based on only one run instead of three random runs.
We selected some promising models from our preliminary experiments and reran them three times using
different random seeds to make the final conclusion reliable and robust. Table 4 includes the standard
deviation and the validation results (where we performed the model selection) for readers to scrutinize.
Table 4: Mean evaluation metrics for our main experiments. A complete version of Table 1.
B Greedy Decoding Algorithm
We provide our exact decoding algorithm in Algorithm 1. We opt to employ adjustable thresholds rather
than argmax prediction, as our empirical findings demonstrate superior performance with this approach
(§5.2).
Figure 7: Anatomy of a human hand. ©American Society for Surgery of the Hand
13
https://aslfont.github.io/Symbol-Font-For-ASL/asl/handshapes.html
14
https://bsl.surrey.ac.uk/principles/i-hand-shapes
Understanding the different characteristics of hands and their implications in signed languages is crucial
for the extraction and classification of hand shapes. These characteristics are based on the SignWriting
definitions of the five major axes of hand variation: handedness, plane, rotation, view, and shape.
Handedness is the distinction between the right and left hands. Signed languages make a distinction
between the dominant hand and the non-dominant hand. For right-handed individuals, the right hand is
considered dominant, and vice-versa. The dominant hand is used for fingerspelling and all one-handed
signs, while the non-dominant hand is used for support and two-handed signs. Using 3D pose estimation,
the handedness analysis is trivial, as the pose estimation platform predicts which hand is which.
Plane refers to whether the hand is parallel to the wall or the floor. The variation in the plane can,
but does not have to, create a distinction between two signs. For example, in ASL the signs for “date”
and “dessert” exhibit the same hand shape, view, rotation, and movement, but differ by plane. The plane
of a hand can be estimated by comparing the positions of the wrist and middle finger metacarpal bone
(M _M CP ).
Rotation refers to the angle of the hand in relation to the body. SignWriting groups the hand rotation
into eight equal categories, each spanning 45 degrees. The rotation of a hand can be calculated by finding
the angle of the line created by the wrist and the middle finger metacarpal bone.
View refers to the side of the hand as observed by the signer, and is grouped into four categories: front,
back, sideways, and other-sideways. The view of a hand can be estimated by analyzing the normal of
the plane created by the palm of the hand (between the wrist, index finger metacarpal bone, and pinky
metacarpal bone).
Shape refers to the configuration of the fingers and thumb. This characteristic of the hand is the most
complex to analyze due to the vast array of possible shapes the human hand can form. The shape of a
hand is determined by the state of each finger and thumb, specifically whether they are straight, curved,
or bent, and their position relative to each other. Shape analysis can be accomplished by examining the
bend and rotation of each finger joint. More advanced models may also take into consideration the spread
between the fingers and other nuanced characteristics. 3D pose estimation can be used to extract these
features for a machine learning model, which can then classify the hand shape.
1. Pose Estimation Initially, the 3D pose of the hand is estimated from the hand image crop (Figure 8).
Figure 8: Pictures of six hands all performing the same hand shape (v-shape) taken from six different orientations.
Mediapipe fails at estimating the pose of the bottom-middle image.
2. 3D Rotation The pose is then rotated in 3D space such that the normal of the back of the hand aligns
with the Z-axis. As a result, the palm plane now resides within the XY plane (Figure 9).
Figure 9: Hand poses after 3D rotation. The scale difference between the hands demonstrates a limitation of the 3D
pose estimation system used.
3. 2D Orientation Subsequently, the pose is rotated in 2D such that the metacarpal bone of the middle
finger aligns with the Y -axis (Figure 10).
4. Scale The hand is scaled such that the metacarpal bone of the middle finger attains a constant length
(which we typically set to 200, Figure 11).
5. Translation Lastly, the wrist joint is translated to the origin of the coordinate system (0, 0, 0). Figure
12 demonstrates how when overlayed, we can see all hands producing the same shape, except for one
outlier.
Figure 12: Normalized hand poses overlayed after being translated to the same position. The positions of the wrist
and the metacarpal bone of the middle finger are fixed.
By conducting these normalization steps, a hand model can be standardized, reducing the complexity of
subsequent steps such as feature extraction and hand shape classification. This standardization simplifies
the recognition process and can contribute to improving the overall accuracy of the system.
Multi Angle Consistency Error (MACE) evaluates the consistency of the pose estimation system
across the different orientations. We perform 3D hand normalization, and overlay the hands. The MACE
score is the average standard deviation of all pose landmarks, between all views. A high MACE score
indicates a problem in the pose estimation system’s ability to maintain consistency across different
orientations. This could adversely affect the model’s performance when analyzing hand shapes in sign
languages, as signs can significantly vary with hand rotation.
Figure 13: Visualizations of 10 hand shapes, each with 6 orientations 3D normalized and overlayed.
Figure 13 shows that our 3D normalization does work to some extent using Mediapipe. We can identify
differences across hand shapes, but still note high variance within each hand shape.
Crop Consistency Error (CCE) gauges the pose estimation system’s consistency across different crop
sizes. We do not perform 3D normalization, but still overlay all the estimated hands, shifting the wrist
point of each estimated hand to the origin (0, 0, 0). The CCE score is the calculated average standard
deviation of all pose landmarks across crops. A high CCE score indicates that the pose estimation system
is sensitive to the size of the input crop, which is a significant drawback as the system should be invariant
to the size of the input image.
Figure 14 shows that for some poses, Mediapipe is very resilient to crop size differences (e.g. the first
and last hand shapes). However, it is concerning that for some hand shapes, it exhibits very high variance,
and possibly even wrong predictions.
C.5 Conclusion
Our normalization process appears to work reasonably well when applied to different views within the
same crop size. It succeeds in simplifying the hand shape, which in turn, can aid in improving the accuracy
of hand shape classification systems.
However, it is crucial to note that while this method may seem to perform well on a static image, its
consistency and reliability in a dynamic context, such as a video, may be quite different. In a video, the
crop size can change between frames, introducing additional complexity and variance. This dynamic
nature coupled with the inherently noisy nature of the estimation process can pose challenges for a model
that aims to consistently estimate hand shapes.
In light of these findings, it is clear that there is a need for the developers of 3D pose estimation
systems to consider these evaluation methods and strive to make their systems more robust to changes in
hand crops. The Multi Angle Consistency Error (MACE) and the Crop Consistency Error (CCE) can be
valuable tools in this regard.
MACE could potentially be incorporated as a loss function for 3D pose estimation, thereby driving
the model to maintain consistency across different orientations. Alternatively, MACE could be used
as an indicator to identify hand shapes that require more training data. It is apparent from our study
that the performance varies greatly across hand shapes and orientations, and this approach could help in
prioritizing the allocation of training resources.
Ultimately, the goal of improving 3D hand pose estimation is to enhance the ability to encode signed
languages accurately. The insights gathered from this study can guide future research and development
efforts in this direction, paving the way for more robust and reliable sign language technology.
The benchmark, metrics, and visualizations are available at https://github.com/
sign-language-processing/3d-hands-benchmark/.