[go: up one dir, main page]

0% found this document useful (0 votes)
23 views22 pages

Linguistically Motivated Sign Language Segmentation

This document presents a novel approach to sign language segmentation, focusing on both individual signs and phrases using a joint model that incorporates linguistic cues. It introduces a BIO tagging scheme to better account for continuous signing and explores the use of optical flow and 3D hand normalization to enhance segmentation accuracy. The proposed method shows improved performance on the Public DGS Corpus and generalizes well to out-of-domain data, addressing the challenges of defining meaningful units in signed languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Linguistically Motivated Sign Language Segmentation

This document presents a novel approach to sign language segmentation, focusing on both individual signs and phrases using a joint model that incorporates linguistic cues. It introduces a BIO tagging scheme to better account for continuous signing and explores the use of optical flow and 3D hand normalization to enhance segmentation accuracy. The proposed method shows improved performance on the Public DGS Corpus and generalizes well to out-of-domain data, addressing the challenges of defining meaningful units in signed languages.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Linguistically Motivated Sign Language Segmentation

Amit Moryossef∗1,2 , Zifan Jiang∗2


Mathias Müller2 , Sarah Ebling2 , Yoav Goldberg1
1
Bar-Ilan University, 2 University of Zurich
amitmoryossef@gmail.com, jiang@cl.uzh.ch

Abstract O O B I I B I I I O B I I I O O
sign 1 sign 2 sign 3
Sign language segmentation is a crucial task phrase
in sign language processing systems. It en- O O B I I I I I I I I I I I O O
ables downstream tasks such as sign recogni-
arXiv:2310.13960v2 [cs.CL] 30 Oct 2023

tion, transcription, and machine translation. In Figure 1: Per-frame classification of a sign language
this work, we consider two kinds of segmen- utterance following a BIO tagging scheme. Each box
tation: segmentation into individual signs and represents a single frame of a video. We propose a joint
segmentation into phrases, larger units compris- model to segment signs (top) and phrases (bottom) at
ing several signs. We propose a novel approach the same time. B=beginning, I=inside, O=outside. The
to jointly model these two tasks. figure illustrates continuous signing where signs often
Our method is motivated by linguistic cues ob- follow each other without an O frame between them.
served in sign language corpora. We replace the
predominant IO tagging scheme with BIO tag-
ging to account for continuous signing. Given Sign language transcription and translation sys-
that prosody plays a significant role in phrase tems rely on the accurate temporal segmentation of
boundaries, we explore the use of optical flow sign language videos into meaningful units such as
features. We also provide an extensive analysis signs (Santemiz et al., 2009; Renz et al., 2021a) or
of hand shapes and 3D hand normalization. signing sequences corresponding to subtitle units1
We find that introducing BIO tagging is nec- (Bull et al., 2020b). However, sign language seg-
essary to model sign boundaries. Explicitly mentation remains a challenging task due to the
encoding prosody by optical flow improves seg- difficulties in defining meaningful units in signed
mentation in shallow models, but its contribu- languages (De Sisto et al., 2021). Our approach
tion is negligible in deeper models. Careful
is the first to consider two kinds of units in one
tuning of the decoding algorithm atop the mod-
els further improves the segmentation quality. model. We simultaneously segment single signs
and phrases (larger units) in a unified framework.
We demonstrate that our final models gener-
Previous work typically approached segmenta-
alize to out-of-domain video content in a dif-
ferent signed language, even under a zero-shot tion as a binary classification task (including seg-
setting. We observe that including optical flow mentation tasks in audio signal processing and com-
and 3D hand normalization enhances the ro- puter vision), where each frame/pixel is predicted
bustness of the model in this context. to be either part of a segment or not. However, this
approach neglects the intricate nuances of contin-
1 Introduction uous signing, where segment boundaries are not
strictly binary and often blend in reality. One sign
Signed languages are natural languages used by
or phrase can immediately follow another, transi-
deaf and hard-of-hearing individuals to commu-
tioning smoothly, without a frame between them
nicate through a combination of manual and non-
being distinctly outside (Figure 1 and §3.1).
manual elements (Sandler and Lillo-Martin, 2006).
We propose incorporating linguistically moti-
Like spoken languages, signed languages have their
vated cues to address these challenges and improve
own distinctive grammar, and vocabulary, that have
sign language segmentation. To cope with contin-
evolved through natural processes of language de-
velopment (Sandler, 2010). 1
Subtitles may not always correspond directly to sentences.
They frequently split within a sentence and could be tempo-
*
Equal contribution authors. rally offset from the corresponding signing segments.
Figure 2: The annotation of the first phrase in a video from the test set (dgskorpus_goe_02), in yellow, signing:
“Why do you smoke?” through the use of three signs: WHY (+mouthed), TO-SMOKE, and a gesture (+mouthed)
towards the other signer. At the top, our phrase segmentation model predicts a single phrase that initiates with a B
tag (in green) above the B-threshold (green dashed line), followed by an I (in light blue), and continues until falling
below a certain threshold. At the bottom, our sign segmentation model accurately segments the three signs.

uous signing, we adopt a BIO-tagging approach 2 Related Work


(Ramshaw and Marcus, 1995), where in addition to
predicting a frame to be in or out of a segment, we 2.1 Sign Language Detection
also classify the beginning of the segment as shown Sign language detection (Borg and Camilleri, 2019;
in Figure 2. Since phrase segmentation is primarily Moryossef et al., 2020; Pal et al., 2023) is the task
marked with prosodic cues (i.e., pauses, extended of determining whether signing activity is present
sign duration, facial expressions) (Sandler, 2010; in a given video frame. A similar task in spoken
Ormel and Crasborn, 2012), we explore using op- languages is voice activity detection (VAD) (Sohn
tical flow to explicitly model them (§3.2). Since et al., 1999; Ramırez et al., 2004), the detection
signs employ a limited number of hand shapes, we of when human voice is used in an audio signal.
additionally perform 3D hand normalization (§3.3). As VAD methods often rely on speech-specific rep-
Evaluating on the Public DGS Corpus (Prillwitz resentations such as spectrograms, they are not
et al., 2008; Hanke et al., 2020) (DGS stands for necessarily applicable to videos.
German Sign Language), we report enhancements Borg and Camilleri (2019) introduced the classi-
in model performance following specific modifica- fication of frames taken from YouTube videos as ei-
tions. We compare our final models after hyperpa- ther signing or not signing. They took a spatial and
rameter optimization, including parameters for the temporal approach based on VGG-16 (Simonyan
decoding algorithm, and find that our best architec- and Zisserman, 2015) CNN to encode each frame
ture using only the poses is comparable to the one and used a Gated Recurrent Unit (GRU) (Cho et al.,
that uses optical flow and hand normalization. 2014) to encode the sequence of frames in a win-
Reassuringly, we find that our model generalizes dow of 20 frames at 5fps. In addition to the raw
when evaluated on additional data from different frame, they either encoded optical-flow history, ag-
signed languages in a zero-shot approach. We ob- gregated motion history, or frame difference.
tain segmentation scores that are competitive with Moryossef et al. (2020) improved upon their
previous work and observe that incorporating opti- method by performing sign language detection in
cal flow and hand normalization makes the model real time. They identified that sign language use
more robust for out-of-domain data. involves movement of the body and, as such, de-
Lastly, we conduct an extensive analysis of pose- signed a model that works on top of estimated hu-
based hand manipulations for signed languages man poses rather than directly on the video signal.
(Appendix C). Despite not improving our segmen- They calculated the optical flow norm of every joint
tation model due to noise from current 3D pose esti- detected on the body and applied a shallow yet ef-
mation models, we emphasize its potential value for fective contextualized model to predict for every
future work involving skeletal hand poses. Based frame whether the person is signing or not.
on this analysis, we propose several measurable While these recent detection models achieve
directions for improving 3D pose estimation. high performance, we need well-annotated data
Our code and models are available at https: including interference and non-signing distractions
//github.com/sign-language-processing/ for proper real-world evaluation. Pal et al. (2023)
transcription. conducted a detailed analysis of the impact of
signer overlap between the training and test sets aries between signs in continuous sign language
on two sign detection benchmark datasets (Sign- videos by employing 3D convolutional neural net-
ing in the Wild (Borg and Camilleri, 2019) and the work representations with iterative temporal seg-
DGS Corpus (Hanke et al., 2020)) used by Borg ment refinement to resolve ambiguities between
and Camilleri (2019) and Moryossef et al. (2020). sign boundary cues. Renz et al. (2021b) fur-
By comparing the accuracy with and without over- ther proposed the Changepoint-Modulated Pseudo-
lap, they noticed a relative decrease in performance Labelling (CMPL) algorithm to solve the problem
for signers not present during training. As a result, of source-free domain adaptation.
they suggested new dataset partitions that eliminate Bull et al. (2021) presented a Transformer-based
overlap between train and test sets and facilitate a approach to segment sign language videos and
more accurate evaluation of performance. align them with subtitles simultaneously, encod-
ing subtitles by BERT (Devlin et al., 2019) and
2.2 Sign Language Segmentation videos by CNN video representations.
Segmentation consists of detecting the frame
3 Motivating Observations
boundaries for signs or phrases in videos to di-
vide them into meaningful units. While the most To motivate our proposed approach, we make a
canonical way of dividing a spoken language text series of observations regarding the intrinsic na-
is into a linear sequence of words, due to the si- ture of sign language expressions. Specifically, we
multaneity of sign language, the notion of a sign highlight the unique challenges posed by the contin-
language “word” is ill-defined, and sign language uous flow of sign language expressions (§3.1), the
cannot be fully linearly modeled. role of prosody in determining phrase boundaries
Current methods resort to segmenting units (§3.2), and the influence of hand shape changes in
loosely mapped to signed language units (Santemiz indicating sign boundaries (§3.3).
et al., 2009; Farag and Brock, 2019; Bull et al.,
2020b; Renz et al., 2021a,b; Bull et al., 2021) and 3.1 Boundary Modeling
do not explicitly leverage reliable linguistic pre- When examining the nature of sign language ex-
dictors of sentence boundaries such as prosody in pressions, we note that the utterances are typically
signed languages (i.e., pauses, extended sign du- signed in a continuous flow, with minimal to no
ration, facial expressions) (Sandler, 2010; Ormel pauses between individual signs. This continuity is
and Crasborn, 2012). De Sisto et al. (2021) call particularly evident when dealing with lower frame
for a better understanding of sign language struc- rates. This continuous nature presents a signifi-
ture, which they believe is the necessary ground cant difference from text where specific punctua-
for the design and development of sign language tion marks serve as indicators of phrase boundaries,
recognition and segmentation methodologies. and a semi-closed set of tokens represent the words.
Santemiz et al. (2009) automatically extracted Given these characteristics, directly applying
isolated signs from continuous signing by align- conventional segmentation or sign language de-
ing the sequences obtained via speech recognition, tection models—that is, utilizing IO tagging in a
modeled by Dynamic Time Warping (DTW) and manner similar to image or audio segmentation
Hidden Markov Models (HMMs) approaches. models—may not yield the optimal solution, par-
Farag and Brock (2019) used a random forest ticularly at the sign level. Such models often fail to
classifier to distinguish frames containing signs in precisely identify the boundaries between signs.
Japanese Sign Language based on the composition A promising alternative is the Beginning-Inside-
of spatio-temporal angular and distance features Outside (BIO) tagging (Ramshaw and Marcus,
between domain-specific pairs of joint segments. 1995). BIO tagging was originally used for named
Bull et al. (2020b) segmented French Sign Lan- entity recognition, but its application extends to any
guage into segments corresponding to subtitle units sequence chunking task beyond the text modality.
by relying on the alignment between subtitles and In the context of sign language, BIO tagging pro-
sign language videos, leveraging a spatio-temporal vides a more refined model for discerning bound-
graph convolutional network (STGCN; Yu et al. aries between signs and phrases, thus significantly
(2017)) with a BiLSTM on 2D skeleton data. improving segmentation performance (Figure 1).
Renz et al. (2021a) located temporal bound- To test the viability of the BIO tagging approach
in comparison with the IO tagging, we conducted utilizing a maximum of two hand shapes. For ex-
an experiment on the Public DGS Corpus. The en- ample, linguistically annotated datasets, such as
tire corpus was transformed to various frame rates ASL-LEX (Sehyr et al., 2021) and ASLLVD (Nei-
and the sign segments were converted to frames dle et al., 2012), only record one initial hand shape
using either BIO or IO tagging, then decoded back per hand and one final hand shape. Mandel (1981,
into sign segments. Figure 4 illustrates the results p. 87) argued that there can only be one set of se-
of this comparison. Note that the IO tagging was lected fingers per sign, constraining the number of
unable to reproduce the same number of segments handshapes in signs. This limitation is referred to
as the BIO tagging on the gold data. This under- as the Selected Fingers Constraint. And indeed,
scores the importance of BIO tagging in identifying Sandler et al. (2008) find that the optimal form of a
sign and phrase boundaries. sign is monosyllabic, and that handshape change is
organized by the syllable unit.
100% To illustrate this constraint empirically, we show
98% a histogram of hand shapes per sign in SignBank2
for 705, 151 signs in Figure 5.
96%
94%
BIO 60%
92% IO
10 20 30 40 50 40%
Figure 4: Reproduced sign segments in the Public DGS 20%
Corpus by BIO and IO tagging at various frame rates.
99.7% of segments reproduced at 25fps by BIO tagging.
0%
0 1 2 3 4
Figure 5: Number of hand shapes per sign in SignBank.
3.2 Phrase Boundaries
Linguistic research has shown that prosody is a re- Additionally, we found that a change in the dom-
liable predictor of phrase boundaries in signed lan- inant hand shape often signals the presence of a
guages (Sandler, 2010; Ormel and Crasborn, 2012). sign boundary. Specifically, out of 27, 658 sen-
We observe that this is also the case in the Public tences, comprising 354, 955 pairs of consecutive
DGS Corpus dataset used in our experiments. To signs, only 17.38% of consecutive signs share the
illustrate this, we model pauses and movement us- same base hand shape3 . Based on these observa-
ing optical flow directly on the poses as proposed tions, we propose using 3D hand normalization
by Moryossef et al. (2020). Figure 3 demonstrates as an indicative cue for hand shapes to assist in
that a change in motion signifies the presence of a detecting sign boundaries. We hypothesize that per-
pause, which, in turn, indicates a phrase boundary. forming 3D hand normalization makes it easier for

3.3 Sign Boundaries 2


https://signbank.org/signpuddle2.0/
3
It is important to note that this percentage is inflated,
We observe that signs generally utilize a limited as it may encompass overlaps across the dominant and non-
number of hand shapes, with the majority of signs dominant hands, which were not separated for this analysis.

body

face

left
right

body

face

left
right

Figure 3: Optical flow for a conversation between two signers (signer 1 top, signer 2 bottom). The x-axis is the
progression across 30 seconds. The yellow marks the annotated phrase spans. (Source: Moryossef et al. (2020))
the model to extract the hand shape. We expand on 4.2 Methodology
this process and show examples in Appendix C. Our proposed approach for sign language segmen-
tation is based on the following steps:
4 Experimental Setup
1. Pose Estimation Given a video, we first ad-
In this section, we describe the experimental setup just it to 25 fps and estimate body poses using
used to evaluate our linguistically motivated ap- the MediaPipe Holistic pose estimation sys-
proach for sign language segmentation. This in- tem. We do not use OpenPose because it lacks
cludes a description of the Public DGS Corpus a Z-axis, which prevents 3D rotation used for
dataset used in the study, the methodology em- hand normalization. The shape of a pose is
ployed to perform sign and phrase segmentation, represented as (frames × keypoints × axes).
and the evaluation metrics used to measure the per-
formance of the proposed approach. 2. Pose Normalization To generalize over video
resolution and distance from the camera, we
4.1 Dataset normalize each of these poses such that the
mean distance between the shoulders of each
The Public DGS Corpus (Prillwitz et al., 2008; person equals 1, and the mid-point is at (0, 0)
Hanke et al., 2020) is a distinctive sign language (Celebi et al., 2013). We also remove the legs
dataset that includes both accurate sign-level anno- since they are less relevant to signing.
tation from continuous signing, and well-aligned
phrase-level translation in spoken language. 3. Optical Flow We follow the equation in
The corpus comprises 404 documents / 714 Moryossef et al. (2020, Equation 1).
videos4 with an average duration of 7.55 minutes,
4. 3D Hand Normalization We rotate and scale
featuring either one signer or two signers, at 50 fps.
each hand to ensure that the same hand shape
Most of these videos feature gloss transcriptions
is represented in a consistent manner across
and spoken language translations (German and En-
different frames. We rotate the 21 XY Z key-
glish), except for the ones in the “Joke” category,
points of the hand so that the back of the hand
which are not annotated and thus excluded from
lies on the XY plane, we then rotate the hand
our model5 . The translations are comprised of full
so that the metacarpal bone of the middle fin-
spoken language paragraphs, sentences, or phrases
ger lies on the Y -axis, and finally, we scale the
(i.e., independent/main clauses).
hand such that the bone is of constant length.
Each gloss span is considered a gold sign seg-
Visualizations are presented in Appendix C.
ment, following a tight annotation scheme (Hanke
et al., 2012). Phrase segments are identified by 5. Sequence Encoder For every frame, the pose
examining every translation, with the segment as- is first flattened and projected into a standard
sumed to span from the start of its first sign to the dimension (256), then fed through an LSTM
end of its last sign, correcting imprecise annotation. encoder (Hochreiter and Schmidhuber, 1997).
This corpus is enriched with full-body pose esti-
mations from OpenPose (Cao et al., 2019; Schul- 6. BIO Tagging On top of the encoder, we place
der and Hanke, 2019) and Mediapipe Holistic (Gr- two BIO classification heads for sign and
ishchenko and Bazarevsky, 2020). We use the phrase independently. B denotes the begin-
3.0.0-uzh-document split from Zhang et al. (2023). ning of a sign or phrase, I denotes the middle
After filtering the unannotated data, we are left with of a sign or phrase, and O denotes being out-
296 documents / 583 videos for training, 6 / 12 for side a sign or phrase. Our cross-entropy loss
validation, and 9 / 17 for testing. The mean number is proportionally weighted in favor of B as it
of signs and phrases in a video from the training is a rare label6 compared to I and O.
set is 613 and 111, respectively.
7. Greedy Segment Decoding To decode the
4 frame-level BIO predictions into sign/phrase
The number of videos is nearly double the number of
documents because each document typically includes two segments, we define a segment to start with
signers, each of whom produces one video for segmentation. the first frame possessing a B probability
5
We also exclude documents with missing annotations.
6
id ∈ {1289910, 1245887, 1289868, 1246064, 1584617} B:I:O is about 1:5:18 for signs and 1:58:77 for phrases.
greater than a predetermined threshold (de- 6. E5: Autoregressive Encoder We replace the
faulted at 0.5). The segment concludes with encoder with the one proposed by Jiang et al.
the first frame among the subsequent frames, (2023) for the detection and classification of
having either a B or O probability exceeding great ape calls from raw audio signals. Specif-
the threshold. We provide the pseudocode of ically, we add autoregressive connections be-
the decoding algorithm in Appendix B. tween time steps to encourage consistent out-
put labels. The logits at time step t are con-
4.3 Experiments catenated to the input of the next time step,
Our approach is evaluated through a series of six t + 1. This modification is implemented bidi-
sets of experiments. Each set is repeated three rectionally by stacking two autoregressive en-
times with varying random seeds. Preliminary coders and adding their output up before the
experiments were conducted to inform the selec- Softmax operation. However, this approach is
tion of hyperparameters and features, the details inherently slow, as we have to fully wait for
of which can be found in Table 3 in Appendix A. the previous time step predictions before we
Model selection is based on validation metrics. can feed them to the next time step.

4.4 Evaluation Metrics


1. E0: IO Tagger We re-implemented and re-
produced7 the sign language detection model We evaluate the performance of our proposed ap-
proposed by Moryossef et al. (2020), in Py- proach for sign and phrase segmentation using the
Torch (Paszke et al., 2019) as a naive baseline. following metrics:
This model processes optical flow as input and
outputs I (is signing) and O (not signing) tags. • Frame-level F1 Score For each frame, we
apply the argmax operation to make a local
2. E1: Bidirectional BIO Tagger We replace prediction of the BIO class and calculate the
the IO tagging heads in E0 with BIO heads macro-averaged per-class F1 score against the
to form our baseline. Our preliminary experi- ground truth label. We use this frame-level
ments indicate that inputting only the 75 hand metric during validation as the primary metric
and body keypoints and making the LSTM for model selection and early stopping, due to
layer bidirectional yields optimal results. its independence from a potentially variable
segment decoding algorithm (§5.2).
3. E2: Adding Reduced Face Keypoints Al-
though the 75 hand and body keypoints serve • Intersection over Union (IoU) We compute
as an efficient minimal set for sign language the IoU between the ground truth segments
detection/segmentation models, we investi- and the predicted segments to measure the de-
gate the impact of other nonmanual sign lan- gree of overlap. Note that we do not perform
guage articulators, namely, the face. We intro- a one-to-one mapping between the two using
duce a reduced set of 128 face keypoints that techniques like DTW. Instead, we calculate
signify the signer’s face contour8 . the total IoU based on all segments in a video.

4. E3: Adding Optical Flow At every time step • Percentage of Segments (%) To comple-
t we append the optical flow between t and ment IoU, we introduce the percentage of
t−1 to the current pose frame as an additional segments to compare the number of seg-
dimension after the XY Z axes. ments predicted by the model with the ground
truth annotations. It is computed as follows:
#predicted segments
5. E4: Adding 3D Hand Normalization At ev- #ground truth segments . The optimal value is 1.
ery time step, we normalize the hand poses
• Efficiency We measure the efficiency of each
and concatenate them to the current pose.
model by the number of parameters and the
7
The initial implementation uses OpenPose (Cao et al., training time of the model on a Tesla V100-
2019), at 50 fps. Preliminary experiments reveal that these SXM2-32GB GPU for 100 epochs9 .
differences do not significantly influence the results.
8 9
We reduce the dense FACE_LANDMARKS in Mediapipe Exceptionally the autoregressive models in E5 were
Holistic to the contour keypoints according to the variable trained on an NVIDIA A100-SXM4-80GB GPUA100 which
mediapipe.solutions.holistic.FACEMESH_CONTOURS. doubles the training speed of V100, still the training is slow.
Sign Phrase Efficiency
Experiment F1 IoU % F1 IoU % #Params Time
E0 Moryossef et al. (2020) — 0.46 1.09 — 0.70 1.00 102K 0:50:17
E1 Baseline 0.56 0.66 0.91 0.59 0.80 2.50 454K 1:01:50
E2 E1 + Face 0.53 0.58 0.64 0.57 0.76 1.87 552K 1:50:31
E3 E1 + Optical Flow 0.58 0.62 1.12 0.60 0.82 3.19 473K 1:20:17
E4 E3 + Hand Norm 0.56 0.61 1.07 0.60 0.80 3.24 516K 1:30:59
E1s E1 + Depth=4 0.63 0.69 1.11 0.65 0.82 1.63 1.6M 4:08:48
E2s E2 + Depth=4 0.62 0.69 1.07 0.63 0.84 2.68 1.7M 3:14:03
E3s E3 + Depth=4 0.60 0.63 1.13 0.64 0.80 1.53 1.7M 4:08:30
E4s E4 + Depth=4 0.59 0.63 1.13 0.62 0.79 1.43 1.7M 4:35:29
E1s* E1s + Tuned Decoding — 0.69 1.03 — 0.85 1.02 — —
E4s* E4s + Tuned Decoding — 0.63 1.06 — 0.79 1.12 — —
E5 E4s + Autoregressive 0.45 0.47 0.88 0.52 0.63 2.72 1.3M ~3 days

Table 1: Mean test evaluation metrics for our experiments. The best score of each column is in bold and a star (*)
denotes further optimization of the decoding algorithm without changing the model (only affects IoU and %). Table
4 in Appendix A contains a complete report including validation metrics and standard deviation of all experiments.

5 Results and Discussion In E2s, we reaffirmed that adding face keypoints


does not yield beneficial results, so we exclude face
We report the mean test evaluation metrics for our
in future experiments. Although the face is an es-
experiments in Table 1. We do not report F1 Score
sential component to understanding sign language
for E0 since it has a different number of classes
expressions and does play some role in sign and
and is thus incomparable. Comparing E1 to E0,
phrase level segmentation, we believe that the 128
we note that the model’s bidirectionality, the use
face contour points are too dense for the model to
of poses, and BIO tagging indeed help outperform
learn useful information compared to the 75 body
the model from previous work where only optical
points, and may instead confuse the model.
flow and IO tagging are used. While E1 predicts
In addition, the benefits of explicitly including
an excessive number of phrase segments, the IoUs
optical flow (E3s) fade away with the increased
for signs and phrases are both higher.
model depth and we speculate that now the model
Adding face keypoints (E2) makes the model
might be able to learn the optical flow features by
worse, while including optical flow (E3) improves
itself. Surprisingly, while adding hand normaliza-
the F1 scores. For phrase segmentation, includ-
tion (E4s) still slightly worsens the overall results,
ing optical flow increases IoU, but over-segments
it has the best phrase percentage.
phrases by more than 300%, which further exagger-
From E4s we proceeded with the training of E5,
ates the issue in E1. Including hand normalization
an autoregressive model. Unexpectedly, counter to
(E4) on top of E3 slightly deteriorates the quality
our intuition and previous work, E5 underachieves
of both sign and phrase segmentation.
our baseline across all evaluation metrics10 .
From the non-exhaustive hyperparameter search
in the preliminary experiments (Table 3), we exam- 5.1 Challenges with 3D Hand Normalization
ined different hidden state sizes (64, 128, 256, 512,
While the use of 3D hand normalization is well-
1024) and a range of 1 to 8 LSTM layers, and con-
justified in §3.3, we believe it does not help
clude that a hidden size of 256 and 4 layers with
the model due to poor depth estimation quality,
1e − 3 learning rate are optimal for E1, which lead
to E1s. We repeat the setup of E2, E3, and E4 with 10
E5 should have more parameters than E4s, but because
these refined hyper-parameters, and show that all of of an implementation bug, each LSTM layer has half the
parameters. Based on the current results, we assume that
them outperform their counterparts, notably in that autoregressive connections (even with more parameters) will
they ease the phrase over-segmentation problem. not improve our models.
as further corroborated by recent research from 0.7
De Coster et al. (2023). Therefore, we consider it a E1s segments
0.6
negative result, showing the deficiencies in the 3D E1s* segments
0.5

Probability density
pose estimation system. The evaluation metrics we Gold segments
propose in Appendix C could help identify better 0.4
pose estimation models for this use case.
0.3
5.2 Tuning the Segment Decoding Algorithm
0.2
We selected E1s and E4s to further explore the
segment decoding algorithm. As detailed in §4.2 0.1
and Appendix B, the decoding algorithm has two 0.0
tunable parameters, thresholdb and thresholdo . 0 5 10 15
We conducted a grid search with these parameters, Length in seconds
using values from 10 to 90 in increments of 10. We
Figure 6: Probability density of phrase segment lengths.
additionally experimented with a variation of the
algorithm that conditions on the most likely class
by argmax instead of fixed threshold values, which sign to the end of its last sign) is tighter than the
turned out similar to the default version. subtitle units used in their paper and that we use
We only measured the results using IoU and the different training datasets of different languages
percentage of segments at validation time since and domains. Nevertheless, we implemented some
the F1 scores remain consistent in this case. For of their frame-level metrics and show the results
sign segmentation, we found using thresholdb = in Table 2 on both the Public DGS Corpus and
60 and thresholdo = 40/50/60 yields slightly the MEDIAPI-SKEL dataset (Bull et al., 2020a)
better results than the default setting (50 for both). in French Sign Language (LSF). We report both
For phrase segmentation, we identified that higher zero-shot out-of-domain results11 and the results
threshold values (thresholdb = 90, thresholdo = of our models trained specifically on their dataset
90 for E1s and thresholdb = 80, thresholdo = without the spatio-temporal graph convolutional
80/90 for E4s) improve on the default significantly, network (ST-GCN) (Yan et al., 2018) used in their
especially on the percentage metric. We report the work for pose encoding.
test results under E1s* and E4s*, respectively.
Despite formulating a single model, we under-
Data Model ROC-AUC F1-M
line a separate sign/phrase model selection process
to archive the best segmentation results. Figure 6 full (theirs) 0.87 —
illustrates how higher threshold values reduce the body (theirs) 0.87 —
number of predicted segments and skew the distri- LSF E1s (ours, zero-shot) 0.71 0.41
bution of predicted phrase segments towards longer E4s (ours, zero-shot) 0.76 0.44
ones in E1s/E1s*. As Bull et al. (2020b) suggest, E1s (ours, trained) 0.87 0.49
advanced priors could also be integrated into the E4s (ours, trained) 0.87 0.51
decoding algorithm. E1s (ours) 0.91 0.65
DGS
E4s (ours) 0.90 0.62
5.3 Comparison to Previous Work
We re-implemented and re-purposed the sign lan- Table 2: Evaluation metrics used in Bull et al. (2020b).
guage detection model introduced in Moryossef ROC-AUC is applied exclusively on the O-tag. For
et al. (2020) for our segmentation task as a baseline comparison F1-M denotes the macro-averaged per-class
since their work is the state-of-the-art and the only F1 used in this work across all tags. The first two rows
are the best results taken from Table 1 in their paper.
comparable model designed for the Public DGS
The next four rows represent how our models perform
Corpus dataset. As a result, we show the necessity on their data in a zero-shot setting, and in a supervised
of replacing IO tagging with BIO tagging to tackle setting, and the last two rows represent how our models
the subtle differences between the two tasks. perform on our data.
For phrase segmentation, we compare to Bull
et al. (2020b). We note that our definition of sign 11
The zero-shot results are not directly comparable to theirs
language phrases (spanning from the start of its first due to different datasets and labeling approaches.
For sign segmentation, we do not compare to featuring white participants.
Renz et al. (2021a,b) due to different datasets and
the difficulty in reproducing their segment-level Encoding of Long Sequences
evaluation metrics. The latter depends on the de- In this study, we encode sequences of frames that
coding algorithm and a way to match the gold and are significantly longer than the typical 512 frames
predicted segments, both of which are variable. often seen in models employing Transformers
(Vaswani et al., 2017). Numerous techniques, rang-
6 Conclusions ing from basic temporal pooling/downsampling to
This work focuses on the automatic segmentation more advanced methods such as a video/pose en-
of signed languages. We are the first to formulate coder that aggregates local frames into higher-level
the segmentation of individual signs and larger sign ‘tokens’ (Renz et al., 2021a), graph convolutional
phrases as a joint problem. networks (Bull et al., 2020b), and self-supervised
We propose a series of improvements over previ- representations (Baevski et al., 2020), can alleviate
ous work, linguistically motivated by careful anal- length constraints, facilitate the use of Transform-
yses of sign language corpora. Recognizing that ers, and potentially improve the outcomes. More-
sign language utterances are typically continuous over, a hierarchical method like the Swin Trans-
with minimal pauses, we opted for a BIO tagging former (Liu et al., 2021) could be applicable.
scheme over IO tagging. Furthermore, leverag- Limitations of Autoregressive LSTMs
ing the fact that phrase boundaries are marked by
prosodic cues, we introduce optical flow features as In this paper, we replicated the autoregressive
a proxy for prosodic processes. Finally, since signs LSTM implementation originally proposed by
typically employ a limited number of hand shapes, Jiang et al. (2023). Our experiments revealed that
to make it easier for the model to understand hand- this implementation exhibits significant slowness,
shapes, we attempt 3D hand normalization. which prevented us from performing further exper-
Our experiments conducted on the Public DGS imentation. In contrast, other LSTM implemen-
Corpus confirmed the efficacy of these modifica- tations employed in this project have undergone
tions for segmentation quality. By comparing to extensive optimization (Appleyard, 2016), includ-
previous work in a zero-shot setting, we demon- ing techniques like combining general matrix mul-
strate that our models generalize across signed lan- tiplication operations (GEMMs), parallelizing in-
guages and domains and that including linguisti- dependent operations, fusing kernels, rearranging
cally motivated cues leads to a more robust model matrices, and implementing various optimizations
in this context. for models with multiple layers (which are not nec-
Finally, we envision that the proposed model has essarily applicable here).
applications in real-world data collection for signed A comparison of CPU-based performance
languages. Furthermore, a similar segmentation demonstrates that our implementation is x6.4 times
approach could be leveraged in various other fields slower. Theoretically, the number of operations
such as co-speech gesture recognition (Moryossef, performed by the autoregressive LSTM is equiv-
2023) and action segmentation (Tang et al., 2019). alent to that of a regular LSTM. However, while
the normal LSTM benefits from concurrency based
Limitations on the number of layers, we do not have that lux-
ury. The optimization of recurrent neural networks
Pose Estimation (RNNs) (Que et al., 2020, 2021, 2022) remains an
In this work, we employ the MediaPipe Holis- ongoing area of research. If proven effective in
tic pose estimation system (Grishchenko and other domains, we strongly advocate for efforts to
Bazarevsky, 2020). There is a possibility that optimize the performance of this type of network.
this system exhibits bias towards certain protected
classes (such as gender or race), underperforming Interference Between Sign and Phrase Models
in instances with specific skin tones or lower video In our model, we share the encoder for both the sign
quality. Thus, we cannot attest to how our system and phrase segmentation models, with a shallow
would perform under real-world conditions, given linear layer for the BIO tag prediction associated
that the videos utilized in our research are gener- with each task. It remains uncertain whether these
ated in a controlled studio environment, primarily two tasks interfere with or enhance each other. An
ablation study (not presented in this work) involv- data from automatic speech recognition / spoken
ing separate modeling is necessary to obtain greater language translation (Tsiamas et al., 2022).
insight into this matter. However, for signed languages, there is nei-
ther a standardized and widely used written form
Noisy Training Objective nor a reliable transcription procedure into some
Although the annotations utilized in this study are potential writing systems like SignWriting (Sut-
of expert level, the determination of precise sign ton, 1990), HamNoSys (Prillwitz and Zienert,
(Hanke et al., 2012) and phrase boundaries remains 1990), and glosses (Johnston, 2008). Transcrip-
a challenging task, even for experts. Training the tion/recognition and segmentation tasks need to be
model on these annotated boundaries might intro- solved simultaneously, so we envision that a multi-
duce excessive noise. A similar issue was observed task setting helps. Sign spotting, the localization of
in classification-based pose estimation (Cao et al., a specific sign in continuous signing, is a simplifi-
2019). The task of annotating the exact anatomical cation of the segmentation and recognition problem
centers of joints proves to be nearly impossible, in a closed-vocabulary setting (Wong et al., 2022;
leading to a high degree of noise when predicting Varol et al., 2022). It can be used to find candidate
joint position as a 1-hot classification task. The boundaries for some signs, but not all.
solution proposed in this previous work was to dis-
tribute a Gaussian around the annotated location Acknowledgements
of each joint. This approach allows the joint’s cen- This work was funded by the EU Horizon 2020
ter to overlap with some probability mass, thereby project EASIER (grant agreement no. 101016982),
reducing the noise for the model. A similar solu- the Swiss Innovation Agency (Innosuisse) flag-
tion could be applied in our context. Instead of ship IICT (PFFS-21-47) and the EU Horizon 2020
predicting a strict 0 or 1 class probability, we could project iEXTRACT (grant agreement no. 802774).
distribute a Gaussian around the boundary. We also thank Rico Sennrich and Chantal Amrhein
for their suggestions.
Naive Segment Decoding
We recognize that the frame-level greedy decod-
ing strategy implemented in our study may not References
be optimal. Previous research in audio segmen- Jeremy Appleyard. 2016. Optimizing recurrent neural
tation (Venkatesh et al., 2022) employed a You networks in cuDNN 5. Accessed: 2023-06-09.
Only Look Once (YOLO; Redmon et al. (2015)) de-
coding scheme to predict segment boundaries and Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed,
and Michael Auli. 2020. wav2vec 2.0: A framework
classes. We propose using a similar prediction atop for self-supervised learning of speech representations.
a given representation, such as the LSTM output or Advances in neural information processing systems,
classification logits of an already trained network. 33:12449–12460.
Differing from traditional object detection tasks,
this process is simplified due to the absence of a Mark Borg and Kenneth P Camilleri. 2019. Sign lan-
guage detection "in the wild" with recurrent neural
Y axis and non-overlapping segments. In this sce- networks. In ICASSP 2019-2019 IEEE International
nario, the network predicts the segment boundaries Conference on Acoustics, Speech and Signal Process-
using regression, thereby avoiding the class imbal- ing (ICASSP), pages 1637–1641. IEEE.
ance issue of the BIO tagging. We anticipate this
Hannah Bull, Triantafyllos Afouras, Gül Varol, Samuel
to yield more accurate sign language segmentation. Albanie, Liliane Momeni, and Andrew Zisserman.
2021. Aligning subtitles in sign language videos. In
Lack of Transcription Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pages 11552–11561.
Speech segmentation is a close task to our sign
language segmentation task on videos. In addi- Hannah Bull, Annelies Braffort, and Michèle Gouif-
tion to relying on prosodic cues from audio, the fès. 2020a. MEDIAPI-SKEL - a 2D-skeleton video
former could benefit from automatic speech tran- database of French Sign Language with aligned
French subtitles. In Proceedings of the Twelfth Lan-
scription systems, either in terms of surrogating guage Resources and Evaluation Conference, pages
the task to text-level segmentation and punctuation 6063–6068, Marseille, France. European Language
(Cho et al., 2015), or gaining additional training Resources Association.
Hannah Bull, Michèle Gouiffès, and Annelies Braffort. Ivan Grishchenko and Valentin Bazarevsky. 2020. Me-
2020b. Automatic segmentation of sign language diapipe holistic.
into subtitle-units. In European Conference on Com-
puter Vision, pages 186–198. Springer. Thomas Hanke, Silke Matthes, Anja Regen, and Satu
Worseck. 2012. Where does a sign start and end?
Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. segmentation of continuous signing. In Language
Sheikh. 2019. OpenPose: Realtime multi-person Resources and Evaluation Conference.
2D pose estimation using part affinity fields. IEEE
Transactions on Pattern Analysis and Machine Intel- Thomas Hanke, Marc Schulder, Reiner Konrad, and
ligence. Elena Jahn. 2020. Extending the Public DGS Corpus
Sait Celebi, Ali Selman Aydin, Talha Tarik Temiz, and in size and depth. In Proceedings of the LREC2020
Tarik Arici. 2013. Gesture recognition using skeleton 9th Workshop on the Representation and Processing
data with weighted dynamic time warping. In Inter- of Sign Languages: Sign Language Resources in the
national Conference on Computer Vision Theory and Service of the Language Community, Technological
Applications. Challenges and Application Perspectives, pages 75–
82, Marseille, France. European Language Resources
Eunah Cho, Jan Niehues, Kevin Kilgour, and Alex Association (ELRA).
Waibel. 2015. Punctuation insertion for real-time
spoken language translation. In Proceedings of the Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
12th International Workshop on Spoken Language short-term memory. Neural computation, 9(8):1735–
Translation: Papers, pages 173–179. 1780.
Kyunghyun Cho, Bart van Merriënboer, Caglar Gul- Zifan Jiang, Adrian Soldati, Isaac Schamberg, Adri-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger ano R Lameira, and Steven Moran. 2023. Auto-
Schwenk, and Yoshua Bengio. 2014. Learning matic sound event detection and classification of
phrase representations using RNN encoder–decoder great ape calls using neural networks. arXiv preprint
for statistical machine translation. In Proceedings arXiv:2301.02214.
of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), pages 1724– Trevor Alexander Johnston. 2008. From archive to
1734, Doha, Qatar. Association for Computational corpus: transcription and annotation in the creation of
Linguistics. signed language corpora. In Pacific Asia Conference
Mathieu De Coster, Ellen Rushe, Ruth Holmes, An- on Language, Information and Computation.
thony Ventresque, and Joni Dambre. 2023. Towards
the extraction of robust sign embeddings for low re- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
source sign language recognition. arXiv preprint Zheng Zhang, Stephen Lin, and Baining Guo. 2021.
arXiv:2306.17558. Swin transformer: Hierarchical vision transformer
using shifted windows. In Proceedings of the
Mirella De Sisto, Dimitar Shterionov, Irene Murtagh, IEEE/CVF international conference on computer vi-
Myriam Vermeerbergen, and Lorraine Leeson. 2021. sion, pages 10012–10022.
Defining meaningful units. challenges in sign seg-
mentation and segment-meaning mapping (short pa- Mark Alan Mandel. 1981. Phonotactics and mor-
per). In Proceedings of the 1st International Work- phophonology in American Sign Language. Uni-
shop on Automatic Translation for Signed and Spoken versity of California, Berkeley.
Languages (AT4SSL), pages 98–103, Virtual. Associ-
ation for Machine Translation in the Americas. Amit Moryossef. 2023. Addressing the blind spots
in spoken language processing. arXiv preprint
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and arXiv:2309.06572.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni,
standing. In Proceedings of the 2019 Conference of Sarah Ebling, and Srini Narayanan. 2020. Real-time
the North American Chapter of the Association for sign-language detection using human pose estimation.
Computational Linguistics: Human Language Tech- In Computer Vision–ECCV 2020 Workshops: Glas-
nologies, Volume 1 (Long and Short Papers), pages gow, UK, August 23–28, 2020, Proceedings, Part
4171–4186, Minneapolis, Minnesota. Association for II 16, SLRTP 2020: The Sign Language Recogni-
Computational Linguistics. tion, Translation and Production Workshop, pages
Iva Farag and Heike Brock. 2019. Learning motion dis- 237–248.
fluencies for automatic sign language segmentation.
In ICASSP 2019-2019 IEEE International Confer- Carol Neidle, Ashwin Thangali, and Stan Sclaroff. 2012.
ence on Acoustics, Speech and Signal Processing Challenges in development of the American Sign
(ICASSP), pages 7360–7364. IEEE. Language lexicon video dataset (ASSLVD) corpus.
In 5th workshop on the representation and processing
Adam Frost and Valerie Sutton. 2022. SignWriting of sign languages: interactions between corpus and
Hand Symbols. SignWriting Press. Lexicon, LREC. Citeseer.
Ellen Ormel and Onno Crasborn. 2012. Prosodic corre- voice activity detection algorithms using long-term
lates of sentences in signed languages: A literature speech information. Speech communication, 42(3-
review and suggestions for new types of studies. Sign 4):271–287.
Language Studies, 12(2):279–315.
Lance Ramshaw and Mitch Marcus. 1995. Text chunk-
Abhilash Pal, Stephan Huber, Cyrine Chaabani, Alessan- ing using transformation-based learning. In Third
dro Manzotti, and Oscar Koller. 2023. On the impor- Workshop on Very Large Corpora.
tance of signer overlap for sign language detection.
arXiv preprint arXiv:2303.10782. Joseph Redmon, Santosh Kumar Divvala, Ross B. Gir-
shick, and Ali Farhadi. 2015. You only look once:
Adam Paszke, Sam Gross, Francisco Massa, Adam Unified, real-time object detection. 2016 IEEE Con-
Lerer, James Bradbury, Gregory Chanan, Trevor ference on Computer Vision and Pattern Recognition
Killeen, Zeming Lin, Natalia Gimelshein, Luca (CVPR), pages 779–788.
Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Te- Katrin Renz, Nicolaj C Stache, Samuel Albanie, and
jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Gül Varol. 2021a. Sign language segmentation with
Junjie Bai, and Soumith Chintala. 2019. PyTorch: temporal convolutional networks. In ICASSP 2021-
An imperative style, high-performance deep learning 2021 IEEE International Conference on Acoustics,
library. In Advances in Neural Information Process- Speech and Signal Processing (ICASSP), pages 2135–
ing Systems 32, pages 8024–8035. Curran Associates, 2139. IEEE.
Inc.
Katrin Renz, Nicolaj C Stache, Neil Fox, Gul Varol,
Siegmund Prillwitz, Thomas Hanke, Susanne König, and Samuel Albanie. 2021b. Sign segmentation with
Reiner Konrad, Gabriele Langer, and Arvid Schwarz. changepoint-modulated pseudo-labelling. In Pro-
2008. Dgs corpus project–development of a cor- ceedings of the IEEE/CVF Conference on Computer
pus based electronic dictionary german sign lan- Vision and Pattern Recognition, pages 3403–3412.
guage/german. In sign-lang at LREC 2008, pages
Wendy Sandler. 2010. Prosody and syntax in sign lan-
159–164. European Language Resources Association
guages. Transactions of the philological society,
(ELRA).
108(3):298–328.
Siegmund Prillwitz and Heiko Zienert. 1990. Hamburg
Wendy Sandler and Diane Lillo-Martin. 2006. Sign
notation system for sign language: Development of
language and linguistic universals. Cambridge Uni-
a sign writing with computer application. In Cur-
versity Press.
rent trends in European Sign Language Research.
Proceedings of the 3rd European Congress on Sign Wendy Sandler, Peter Macneilage, Barbara Davis, Kris-
Language Research, pages 355–379. tine Zajdo, New York, and Taylor Francis. 2008. The
syllable in sign language: Considering the other nat-
Zhiqiang Que, Hiroki Nakahara, Hongxiang Fan, He Li,
ural language modality.
Jiuxi Meng, Kuen Hung Tsoi, Xinyu Niu, Eriko
Nurvitadhi, and Wayne W. C. Luk. 2022. Remarn: A Pinar Santemiz, Oya Aran, Murat Saraclar, and Lale
reconfigurable multi-threaded multi-core accelerator Akarun. 2009. Automatic sign segmentation from
for recurrent neural networks. ACM Transactions on continuous signing via multiple sequence alignment.
Reconfigurable Technology and Systems, 16:1 – 26. In 2009 IEEE 12th International Conference on Com-
puter Vision Workshops, ICCV Workshops, pages
Zhiqiang Que, Hiroki Nakahara, Eriko Nurvitadhi, 2001–2008. IEEE.
Hongxiang Fan, Chenglong Zeng, Jiuxi Meng, Xinyu
Niu, and Wayne W. C. Luk. 2020. Optimizing Marc Schulder and Thomas Hanke. 2019. OpenPose in
reconfigurable recurrent neural networks. 2020 the Public DGS Corpus. Project Note AP06-2019-
IEEE 28th Annual International Symposium on 01, DGS-Korpus project, IDGS, Hamburg University,
Field-Programmable Custom Computing Machines Hamburg, Germany.
(FCCM), pages 10–18.
Zed Sevcikova Sehyr, Naomi Caselli, Ariel M Cohen-
Zhiqiang Que, Erwei Wang, Umar Marikar, Eric Goldberg, and Karen Emmorey. 2021. The ASL-
Moreno, Jennifer Ngadiuba, Hamza Javed, LEX 2.0 project: A database of lexical and phono-
Bartłomiej Borzyszkowski, Thea Klaeboe Aarrestad, logical properties for 2,723 signs in american sign
Vladimir Loncar, Sioni Paris Summers, Maurizio language. The Journal of Deaf Studies and Deaf
Pierini, Peter Y. K. Cheung, and Wayne W. C. Luk. Education, 26(2):263–277.
2021. Accelerating recurrent neural networks for
gravitational wave experiments. 2021 IEEE 32nd Karen Simonyan and Andrew Zisserman. 2015. Very
International Conference on Application-specific deep convolutional networks for large-scale image
Systems, Architectures and Processors (ASAP), pages recognition. CoRR.
117–124.
Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. 1999.
Javier Ramırez, José C Segura, Carmen Benıtez, Angel A statistical model-based voice activity detection.
De La Torre, and Antonio Rubio. 2004. Efficient IEEE signal processing letters, 6(1):1–3.
William C Stokoe Jr. 1960. Sign Language Structure:
An Outline of the Visual Communication Systems of
the American Deaf. The Journal of Deaf Studies and
Deaf Education, 10(1):3–37.
Valerie Sutton. 1990. Lessons in sign writing. Sign-
Writing.
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng,
Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou.
2019. COIN: A large-scale dataset for comprehen-
sive instructional video analysis. In IEEE Confer-
ence on Computer Vision and Pattern Recognition
(CVPR).
Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonol-
losa, and Marta R. Costa-jussà. 2022. SHAS:
Approaching optimal segmentation for end-to-end
speech translation. In Proc. Interspeech 2022, pages
106–110.
Gül Varol, Liliane Momeni, Samuel Albanie, Triantafyl-
los Afouras, and Andrew Zisserman. 2022. Scaling
up sign spotting through sign language dictionaries.
International Journal of Computer Vision, 130:1416
– 1439.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Satvik Venkatesh, David Moffat, and Eduardo Reck
Miranda. 2022. You only hear once: a YOLO-like
algorithm for audio segmentation and sound event
detection. Applied Sciences, 12(7):3293.
Ryan Wong, Necati Cihan Camgoz, and R. Bowden.
2022. Hierarchical I3D for sign spotting. In ECCV
Workshops.
Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial
temporal graph convolutional networks for skeleton-
based action recognition. In Proceedings of the AAAI
conference on artificial intelligence, volume 32.
Ting Yu, Haoteng Yin, and Zhanxing Zhu. 2017. Spatio-
temporal graph convolutional networks: A deep learn-
ing framework for traffic forecasting. In Interna-
tional Joint Conference on Artificial Intelligence.
Biao Zhang, Mathias Müller, and Rico Sennrich. 2023.
SLTUNET: A simple unified model for sign language
translation. In The Eleventh International Confer-
ence on Learning Representations, Kigali, Rwanda.
A Extended Experimental Results
We conducted some preliminary experiments (starting with P0) on training a sign language segmentation
model to gain insights into hyperparameters and feature choices. The results are shown in Table 312 . We
found in P1.3.2 the optimal hyperparameters and repeated them with different feature choices.

Sign Phrase
Experiment F1 IoU % F1 IoU %
P0 Moryossef et al. (2020) test — 0.4 1.45 — 0.65 0.82
dev — 0.35 1.36 — 0.6 0.77
P0.1 P0 + Holistic 25fps test — 0.39 0.86 — 0.64 0.5
dev — 0.32 0.81 — 0.58 0.52
P1 P1 baseline test 0.55 0.49 0.83 0.6 0.67 2.63
dev 0.56 0.43 0.75 0.58 0.62 2.61
P1.1 P1 - encoder_bidirectional test 0.48 0.45 0.68 0.5 0.64 2.68
dev 0.46 0.41 0.64 0.51 0.61 2.56
P1.2.1 P1 + hidden_dim=512 test 0.47 0.42 0.44 0.52 0.63 1.7
dev 0.46 0.4 0.43 0.52 0.61 1.69
P1.2.2 P1 + hidden_dim=1024 test 0.48 0.45 0.42 0.58 0.65 1.53
dev 0.46 0.41 0.36 0.53 0.61 1.49
P1.3.1 P1 + encoder_depth=2 test 0.55 0.48 0.76 0.58 0.67 2.56
dev 0.56 0.43 0.69 0.58 0.62 2.52
P1.3.2 P1 + encoder_depth=4 test 0.63 0.51 0.91 0.66 0.67 1.41
dev 0.61 0.47 0.84 0.64 0.6 1.39
P1.4.1 P1 + hidden_dim=128 + encoder_depth=2 test 0.58 0.48 0.8 0.6 0.67 2.0
dev 0.55 0.43 0.75 0.54 0.62 2.03
P1.4.2 P1 + hidden_dim=128 + encoder_depth=4 test 0.62 0.51 0.91 0.64 0.68 2.43
dev 0.6 0.47 0.83 0.6 0.62 2.57
P1.4.3 P1 + hidden_dim=128 + encoder_depth=8 test 0.59 0.52 0.91 0.63 0.68 3.04
dev 0.6 0.47 0.84 0.6 0.62 3.02
P1.5.1 P1 + hidden_dim=64 + encoder_depth=4 test 0.57 0.5 0.8 0.6 0.68 2.41
dev 0.58 0.45 0.75 0.59 0.62 2.39
P1.5.2 P1 + hidden_dim=64 + encoder_depth=8 test 0.62 0.51 0.85 0.64 0.68 2.53
dev 0.6 0.46 0.79 0.6 0.62 2.53
P2 P1 + optical_flow test 0.58 0.5 0.95 0.63 0.68 3.17
dev 0.59 0.45 0.84 0.59 0.61 3.08
P2.1 P1.3.2 + optical_flow test 0.63 0.51 0.92 0.66 0.67 1.51
dev 0.62 0.46 0.81 0.62 0.6 1.53
P3 P1 + hand_normalization test 0.55 0.48 0.77 0.58 0.67 2.79
dev 0.55 0.42 0.71 0.57 0.62 2.73
P3.1 P1.3.2 + hand_normalization test 0.63 0.51 0.91 0.66 0.66 1.43
dev 0.61 0.46 0.82 0.64 0.61 1.46
P4 P2.1 + P3.1 test 0.56 0.51 0.92 0.61 0.66 1.45
dev 0.61 0.46 0.81 0.63 0.6 1.41
P4.1 P4 + encoder_depth=8 test 0.6 0.51 0.95 0.62 0.67 1.08
dev 0.61 0.47 0.86 0.62 0.6 1.12
P5 P1.3.2 + reduced_face test 0.63 0.51 0.94 0.64 0.66 1.16
dev 0.61 0.47 0.86 0.64 0.58 1.14
P5.1 P1.3.2 + full_face test 0.54 0.49 0.8 0.6 0.68 2.29
dev 0.57 0.45 0.7 0.59 0.62 2.29

Table 3: Results of the preliminary experiments.

12
Note that due to an implementation issue on edge cases (which we fixed later), the IoU and % values in Table 3 are lower
than the ones in Table 1 and Table 4 thus not comparable across tables. The comparison inside of Table 3 between different
experiments remains meaningful. In addition, the results in Table 3 are based on only one run instead of three random runs.
We selected some promising models from our preliminary experiments and reran them three times using
different random seeds to make the final conclusion reliable and robust. Table 4 includes the standard
deviation and the validation results (where we performed the model selection) for readers to scrutinize.

Sign Phrase Efficiency


Experiment F1 IoU % F1 IoU % #Params Time
E0 Moryossef et al. (2020) test — 0.46 ± 0.03 1.09 ± 0.41 — 0.70 ± 0.01 1.00 ± 0.06 102K 0:50:17
dev — 0.42 ± 0.05 1.21 ± 0.59 — 0.61 ± 0.06 2.47 ± 0.85 102K 0:50:17
E1 Baseline test 0.56 ± 0.03 0.66 ± 0.01 0.91 ± 0.05 0.59 ± 0.02 0.80 ± 0.03 2.50 ± 0.13 454K 1:01:50
dev 0.55 ± 0.01 0.59 ± 0.00 1.12 ± 0.11 0.56 ± 0.02 0.75 ± 0.05 2.94 ± 0.08 454K 1:01:50
E2 E1 + Face test 0.53 ± 0.05 0.58 ± 0.07 0.64 ± 0.30 0.57 ± 0.02 0.76 ± 0.03 1.87 ± 0.83 552K 1:50:31
dev 0.50 ± 0.07 0.53 ± 0.11 0.90 ± 0.19 0.53 ± 0.05 0.71 ± 0.07 2.43 ± 1.02 552K 1:50:31
E3 E1 + Optical Flow test 0.58 ± 0.01 0.62 ± 0.00 1.12 ± 0.05 0.60 ± 0.03 0.82 ± 0.03 3.19 ± 0.11 473K 1:20:17
dev 0.58 ± 0.00 0.62 ± 0.00 1.50 ± 0.19 0.59 ± 0.01 0.79 ± 0.00 3.94 ± 0.14 473K 1:20:17
E4 E3 + Hand Norm test 0.56 ± 0.02 0.61 ± 0.00 1.07 ± 0.05 0.60 ± 0.00 0.80 ± 0.00 3.24 ± 0.17 516K 1:30:59
dev 0.57 ± 0.01 0.61 ± 0.01 1.50 ± 0.07 0.58 ± 0.00 0.79 ± 0.00 4.04 ± 0.31 516K 1:30:59
E1s E1 + Depth=4 test 0.63 ± 0.01 0.69 ± 0.00 1.11 ± 0.01 0.65 ± 0.02 0.82 ± 0.04 1.63 ± 0.10 1.6M 4:08:48
dev 0.61 ± 0.00 0.63 ± 0.00 1.27 ± 0.01 0.63 ± 0.01 0.77 ± 0.01 2.17 ± 0.18 1.6M 4:08:48
E2s E2 + Depth=4 test 0.62 ± 0.02 0.69 ± 0.00 1.07 ± 0.03 0.63 ± 0.01 0.84 ± 0.03 2.68 ± 0.53 1.7M 3:14:03
dev 0.60 ± 0.01 0.63 ± 0.01 1.20 ± 0.12 0.59 ± 0.02 0.76 ± 0.05 3.30 ± 0.62 1.7M 3:14:03
E3s E3 + Depth=4 test 0.60 ± 0.01 0.63 ± 0.00 1.13 ± 0.01 0.64 ± 0.03 0.80 ± 0.03 1.53 ± 0.18 1.7M 4:08:30
dev 0.62 ± 0.00 0.63 ± 0.00 1.63 ± 0.05 0.63 ± 0.00 0.76 ± 0.00 2.14 ± 0.09 1.7M 4:08:30
E4s E4 + Depth=4 test 0.59 ± 0.00 0.63 ± 0.00 1.13 ± 0.03 0.62 ± 0.00 0.79 ± 0.00 1.43 ± 0.10 1.7M 4:35:29
dev 0.61 ± 0.00 0.63 ± 0.00 1.56 ± 0.04 0.63 ± 0.00 0.77 ± 0.01 1.89 ± 0.07 1.7M 4:35:29
E4ba E4s + Autoregressive test 0.45 ± 0.03 0.47 ± 0.05 0.88 ± 0.08 0.52 ± 0.02 0.63 ± 0.10 2.72 ± 1.33 1.3M 2 days, 21:28:42
dev 0.40 ± 0.01 0.40 ± 0.01 2.02 ± 0.73 0.47 ± 0.00 0.57 ± 0.04 4.26 ± 1.26 1.3M 2 days, 21:28:42

Table 4: Mean evaluation metrics for our main experiments. A complete version of Table 1.
B Greedy Decoding Algorithm
We provide our exact decoding algorithm in Algorithm 1. We opt to employ adjustable thresholds rather
than argmax prediction, as our empirical findings demonstrate superior performance with this approach
(§5.2).

Algorithm 1 Probabilities to Segments Conversion.


Require: probs, a list of probabilities from 0 to 100
1: thresholdb ← 50.0
2: thresholdo ← 50.0
3:
4: start ← N one
5: did_pass_start ← F alse
6:
7: for i = 0 to len(probs) do
8: b, i, o ← probs[i]
9:
10: if start = N one then
11: if b > thresholdb then
12: start ← i
13: end if
14: else
15: if did_pass_start then
16: if b > thresholdb or o > thresholdo then
17: yield (start, i − 1))
18: start ← N one
19: did_pass_start ← F alse
20: end if
21: else
22: if b < thresholdb then
23: did_pass_start ← T rue
24: end if
25: end if
26: end if
27: end for
28:
29: if start ̸= N one then
30: yield (start, len(probs)))
31: end if
C Pose Based Hand Shape Analysis
C.1 Introduction to Hand Shapes in Sign Language
The most prominent feature of signed languages is their use of the hands. In fact, the hands play an
integral role in the phonetics of signs, and a slight variation in hand shape can convey differences in
meaning (Stokoe Jr, 1960). In sign languages such as American Sign Language (ASL) and British Sign
Language (BSL), different hand shapes contribute to the vocabulary of the language, similar to how
different sounds contribute to the vocabulary of spoken languages. ASL is estimated to use between 30 to
80 hand shapes13 , while BSL is limited to approximately 40 hand shapes14 . SignWriting (Sutton, 1990), a
system of notation used for sign languages, specifies a superset of 261 distinct hand shapes (Frost and
Sutton, 2022). Each sign language uses a subset of these hand shapes.
Despite the fundamental role of hand shapes in sign languages, accurately recognizing and classifying
them is a challenging task. In this section, we explore rule-based hand shape analysis in sign languages
using 3D hand normalization. By performing 3D hand normalization, we can transform any given hand
shape to a fixed orientation, making it easier for a model to extract the hand shape, and hence improving
the recognition and classification of hand shapes in sign languages.

C.2 Characteristics of the Human Hand


The human hand consists of 27 bones and can be divided into three main sections: the wrist (carpals), the
palm (metacarpals), and the fingers (phalanges). Each finger consists of three bones, except for the thumb,
which has two. The bones are connected by joints, which allow for the complex movements and shapes
that the hand can form.

Figure 7: Anatomy of a human hand. ©American Society for Surgery of the Hand
13
https://aslfont.github.io/Symbol-Font-For-ASL/asl/handshapes.html
14
https://bsl.surrey.ac.uk/principles/i-hand-shapes
Understanding the different characteristics of hands and their implications in signed languages is crucial
for the extraction and classification of hand shapes. These characteristics are based on the SignWriting
definitions of the five major axes of hand variation: handedness, plane, rotation, view, and shape.
Handedness is the distinction between the right and left hands. Signed languages make a distinction
between the dominant hand and the non-dominant hand. For right-handed individuals, the right hand is
considered dominant, and vice-versa. The dominant hand is used for fingerspelling and all one-handed
signs, while the non-dominant hand is used for support and two-handed signs. Using 3D pose estimation,
the handedness analysis is trivial, as the pose estimation platform predicts which hand is which.
Plane refers to whether the hand is parallel to the wall or the floor. The variation in the plane can,
but does not have to, create a distinction between two signs. For example, in ASL the signs for “date”
and “dessert” exhibit the same hand shape, view, rotation, and movement, but differ by plane. The plane
of a hand can be estimated by comparing the positions of the wrist and middle finger metacarpal bone
(M _M CP ).

Algorithm 2 Hand Plane Estimation


1: y ← |M _M CP.y − W RIST.y| × 1.5 // add bias to y
2: z ← |M _M CP.z − W RIST.z|
3: return y > z ? ‘wall’ : ‘floor’

Rotation refers to the angle of the hand in relation to the body. SignWriting groups the hand rotation
into eight equal categories, each spanning 45 degrees. The rotation of a hand can be calculated by finding
the angle of the line created by the wrist and the middle finger metacarpal bone.
View refers to the side of the hand as observed by the signer, and is grouped into four categories: front,
back, sideways, and other-sideways. The view of a hand can be estimated by analyzing the normal of
the plane created by the palm of the hand (between the wrist, index finger metacarpal bone, and pinky
metacarpal bone).

Algorithm 3 Hand View Estimation


1: normal ← math.normal(WRIST, I_MCP, P_MCP)
2: plane ← get_plane(WRIST, M_MCP)
3: if plane = ‘wall’ then
4: angle ← ̸ (normal.z, normal.x)
5: return angle > 210 ? ‘front’ : (angle > 150 ? ‘sideways’ : ‘back’)
6: else
7: angle ← ̸ (normal.y, normal.x)
8: return angle > 0 ? ‘front’ : (angle > −60 ? ‘sideways’ : ‘back’)
9: end if

Shape refers to the configuration of the fingers and thumb. This characteristic of the hand is the most
complex to analyze due to the vast array of possible shapes the human hand can form. The shape of a
hand is determined by the state of each finger and thumb, specifically whether they are straight, curved,
or bent, and their position relative to each other. Shape analysis can be accomplished by examining the
bend and rotation of each finger joint. More advanced models may also take into consideration the spread
between the fingers and other nuanced characteristics. 3D pose estimation can be used to extract these
features for a machine learning model, which can then classify the hand shape.

C.3 3D Hand Normalization


3D hand normalization is an attempt at standardizing the orientation and position of the hand, thereby
enabling models to effectively classify various hand shapes. The normalization process involves several
key steps, as illustrated below:

1. Pose Estimation Initially, the 3D pose of the hand is estimated from the hand image crop (Figure 8).

Figure 8: Pictures of six hands all performing the same hand shape (v-shape) taken from six different orientations.
Mediapipe fails at estimating the pose of the bottom-middle image.

2. 3D Rotation The pose is then rotated in 3D space such that the normal of the back of the hand aligns
with the Z-axis. As a result, the palm plane now resides within the XY plane (Figure 9).

Figure 9: Hand poses after 3D rotation. The scale difference between the hands demonstrates a limitation of the 3D
pose estimation system used.
3. 2D Orientation Subsequently, the pose is rotated in 2D such that the metacarpal bone of the middle
finger aligns with the Y -axis (Figure 10).

Figure 10: Hand poses after being rotated.

4. Scale The hand is scaled such that the metacarpal bone of the middle finger attains a constant length
(which we typically set to 200, Figure 11).

Figure 11: Hand poses after being scaled.

5. Translation Lastly, the wrist joint is translated to the origin of the coordinate system (0, 0, 0). Figure
12 demonstrates how when overlayed, we can see all hands producing the same shape, except for one
outlier.
Figure 12: Normalized hand poses overlayed after being translated to the same position. The positions of the wrist
and the metacarpal bone of the middle finger are fixed.

By conducting these normalization steps, a hand model can be standardized, reducing the complexity of
subsequent steps such as feature extraction and hand shape classification. This standardization simplifies
the recognition process and can contribute to improving the overall accuracy of the system.

C.4 3D Hand Pose Evaluation


In order to assess the performance of our 3D hand pose estimation and normalization, we introduce two
metrics that gauge the consistency of the pose estimation across orientations and crops.
Our dataset is extracted from the SignWriting Hand Symbols Manual (Frost and Sutton, 2022), and
includes images of 261 different hand shapes, from 6 different angles. All images are of the same hand, of
an adult white man.

Multi Angle Consistency Error (MACE) evaluates the consistency of the pose estimation system
across the different orientations. We perform 3D hand normalization, and overlay the hands. The MACE
score is the average standard deviation of all pose landmarks, between all views. A high MACE score
indicates a problem in the pose estimation system’s ability to maintain consistency across different
orientations. This could adversely affect the model’s performance when analyzing hand shapes in sign
languages, as signs can significantly vary with hand rotation.

Figure 13: Visualizations of 10 hand shapes, each with 6 orientations 3D normalized and overlayed.

Figure 13 shows that our 3D normalization does work to some extent using Mediapipe. We can identify
differences across hand shapes, but still note high variance within each hand shape.
Crop Consistency Error (CCE) gauges the pose estimation system’s consistency across different crop
sizes. We do not perform 3D normalization, but still overlay all the estimated hands, shifting the wrist
point of each estimated hand to the origin (0, 0, 0). The CCE score is the calculated average standard
deviation of all pose landmarks across crops. A high CCE score indicates that the pose estimation system
is sensitive to the size of the input crop, which is a significant drawback as the system should be invariant
to the size of the input image.

Figure 14: Visualizations of 10 hand shapes, each with 48 crops overlayed.

Figure 14 shows that for some poses, Mediapipe is very resilient to crop size differences (e.g. the first
and last hand shapes). However, it is concerning that for some hand shapes, it exhibits very high variance,
and possibly even wrong predictions.

C.5 Conclusion
Our normalization process appears to work reasonably well when applied to different views within the
same crop size. It succeeds in simplifying the hand shape, which in turn, can aid in improving the accuracy
of hand shape classification systems.
However, it is crucial to note that while this method may seem to perform well on a static image, its
consistency and reliability in a dynamic context, such as a video, may be quite different. In a video, the
crop size can change between frames, introducing additional complexity and variance. This dynamic
nature coupled with the inherently noisy nature of the estimation process can pose challenges for a model
that aims to consistently estimate hand shapes.
In light of these findings, it is clear that there is a need for the developers of 3D pose estimation
systems to consider these evaluation methods and strive to make their systems more robust to changes in
hand crops. The Multi Angle Consistency Error (MACE) and the Crop Consistency Error (CCE) can be
valuable tools in this regard.
MACE could potentially be incorporated as a loss function for 3D pose estimation, thereby driving
the model to maintain consistency across different orientations. Alternatively, MACE could be used
as an indicator to identify hand shapes that require more training data. It is apparent from our study
that the performance varies greatly across hand shapes and orientations, and this approach could help in
prioritizing the allocation of training resources.
Ultimately, the goal of improving 3D hand pose estimation is to enhance the ability to encode signed
languages accurately. The insights gathered from this study can guide future research and development
efforts in this direction, paving the way for more robust and reliable sign language technology.
The benchmark, metrics, and visualizations are available at https://github.com/
sign-language-processing/3d-hands-benchmark/.

You might also like