Open AccessArticle

Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits

Ivane Delos Santos Chen

^1,†,

Chieh-Ming Yang

^1,†

Shang-Shu Wu

²,

Chih-Kang Yang

³,

Mei-Juan Chen

^1,*,

Chia-Hung Yeh

^4,5,*

and

Yuan-Hong Lin

Department of Electrical Engineering, National Dong Hwa University, Hualien 97401, Taiwan

Department of Education and Human Potentials Development, National Dong Hwa University, Hualien 97401, Taiwan

Department of Special Education, National Dong Hwa University, Hualien 97401, Taiwan

⁴

Department of Electrical Engineering, National Taiwan Normal University, Taipei 10610, Taiwan

⁵

Department of Electrical Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Algorithms 2024, 17(7), 300; https://doi.org/10.3390/a17070300

Submission received: 30 April 2024 / Revised: 24 June 2024 / Accepted: 2 July 2024 / Published: 7 July 2024

(This article belongs to the Special Issue Algorithms for Image Processing and Machine Vision)

Download

Browse Figures

Versions Notes

Abstract

In the era of inclusive education, students with attention deficits are integrated into the general classroom. To ensure a seamless transition of students’ focus towards the teacher’s instruction throughout the course and to align with the teaching pace, this paper proposes a continuous recognition algorithm for capturing teachers’ dynamic gesture signals. This algorithm aims to offer instructional attention cues for students with attention deficits. According to the body landmarks of the teacher’s skeleton by using vision and machine learning-based MediaPipe BlazePose, the proposed method uses simple rules to detect the teacher’s hand signals dynamically and provides three kinds of attention cues (Pointing to left, Pointing to right, and Non-pointing) during the class. Experimental results show the average accuracy, sensitivity, specificity, precision, and F₁ score achieved 88.31%, 91.03%, 93.99%, 86.32%, and 88.03%, respectively. By analyzing non-verbal behavior, our method of competent performance can replace verbal reminders from the teacher and be helpful for students with attention deficits in inclusive education.

Keywords:

continuous recognition; hand signals; MediaPipe; landmarks; skeleton; attention deficits; inclusive education

1. Introduction

Under the trend of inclusive education, all kinds of students with special needs enter the general classroom to receive an education. Learning in a general classroom requires concentration and attention in the face of the teacher’s teaching. However, many students with attention deficits are not adept at this. Problems with joint attention among students with autism spectrum disorder (ASD) may continue into school [1]. Students with attention deficit hyperactivity disorder (ADHD) often require special attention from teachers [2]. It is also not easy for students with intellectual developmental disabilities (IDDs) [3] to maintain enough attention to the teacher’s teaching in class. Even students with typical development need to pay attention to body language and teaching rhythm. Mitigating attention deficits in students with ASD caused by difficulties in processing dynamic visual information can be achieved by slowing down the speed of visual information and utilizing haptic technology [4]. However, real-time visual information in the classroom cannot be slowed down. Instead, this visual information can be converted into vibration signals to provide haptic feedback. This feedback model can also be beneficial for other students with attention deficits. To obtain a better learning outcome, there is a real need to use engineering technology to assist students’ learning in inclusive education.

In recent years, the use of technology to support students can be divided into iPad applications (Apps), augmented reality (AR), virtual reality (VR), and artificial intelligence (AI). App-based courses are effective in learning and communication skills for students with attention deficits [5]. Studies have pointed out that AR-based courses increase learning effectiveness and learning motivation [6,7]. Courses involving VR may improve cognitive ability and executive function [8]. The research on AI intervention in education encompasses multiple topics such as aiding teachers’ decision making, enhancing smart teaching systems, and improving educational data collection [9]. However, technology to assist students with attention deficits in adapting their cognitive abilities in the general classroom is rare. Researchers also look forward to developing student-centered technology that enables students to adapt to any teacher’s teaching or curriculum.

Studies regarding teachers’ hand signals or gestures have been investigated previously. Some of these studies incorporate OpenPose, a tool capable of multi-person keypoint detection and pose estimation from images and videos [10,11], providing effective technical support for analyzing gestures. In [12], a two-stage method is proposed to detect pointing or not to an electronic whiteboard, named the instructional pointing gesture. In the first stage, the teacher’s 25 joint points extracted by OpenPose are put into the gesture recognition neural network to detect whether the teacher is stretching out his/her arm and pointing to something. Then, the instructional pointing gesture is detected by verifying whether the teacher is specifically pointing at the whiteboard by analyzing the relative positioning of the teacher’s hand and the whiteboard. In [13], a hand gesture recognition model for infrared images is proposed in classroom scenarios. Initially, the infrared teaching images are fed into the convolutional neural network (CNN) proposed in the study. This CNN extracts the positions of 25 joint points, which are subsequently used to determine the probability of the hand pose during teaching. The electronic display recognition algorithm is also realized. The gesture recognition methodology described in [14] operates with two different types of video recordings: exact gesture recording and real-class recording. Sixty keypoints in total comprising 18 body keypoints, 21 left-hand keypoints, and 21 right-hand keypoints are processed by OpenPose. Subsequently, machine learning implemented through Microsoft Azure Machine Learning Studio is applied to decide whether the teacher is gesticulating or not. In [15], deep learning is utilized to identify five types of online teaching gestures, including indicative, one-hand beat, two-hand beat, frontal habitual, and lateral habitual gestures. The study analyzes these gestures using images marked by human keypoint detection and establishes a recognition model with a ten-layer CNN. The study in [16] illustrates the effectiveness of using smart phone videos, in conjunction with OpenPose and machine learning algorithms, to estimate the direction of gaze and body orientation in elementary school classrooms. In this work, the body keypoints are provided by OpenPose for each person in the classroom. The main objective of [17] is to compare the non-verbal teaching behaviors, including arm extensions and body orientation, exhibited by both pre-service teachers and in-service teachers in asynchronous online videos. By utilizing deep learning technology, the teachers’ poses are quantified while they perform instruction towards a video camera, with special attention given to the arm stretch range and body orientation related to the subject topic. In addition, hand signals or gestures have also been explored in various educational applications [18,19,20,21,22,23].

In recent years, there has been an increasing emphasis on students’ digital skills to facilitate learning, as well as on improving teachers’ digital competencies to enhance teaching practices. Students necessitate a comprehensive grasp of information and communication technology, which has the potential to enrich, transform, and elevate the quality of their learning experiences. Effective utilization of technology in the classroom relies on both technological accessibility and the routine choices made by teachers who incorporate technology into their teaching practices [24]. The developments of advanced technologies in smart classrooms are expected to enhance teaching/learning experiences and interactions between teachers and students [25]. The teacher’s hand signals, serving as a non-verbal form of communication, can significantly aid students in concentrating on the learning content, thereby facilitating an improvement in their learning performance. The ability of students to recognize and comprehend these hand signals is crucial as it facilitates a seamless transition of their attention, enabling them to follow the teacher’s instructions, engage with the content, and actively participate in the learning process, thereby enhancing the overall educational experience more effectively.

Most of the research mainly gives teachers useful feedback about their gesticulation in class and helps teachers to improve their teaching practices [12,13,14,15,16,17,18]. Our goal is to develop a student-centered technology and highlight the significance of recognizing teachers’ hand signals. To decrease the requirement of verbal messages from the teacher and help students connect to the rhythm of teaching more quickly by themselves, we are therefore interested in exploring how to offer a dynamic and instructional reminder of simple attention cues for assisting students with attention deficits in classes of inclusive education.

2. Proposed Method

Our study targets scenes captured from a video camera in an elementary school classroom. To provide a simple and practical attention cue to the students, the teacher’s hand signals are classified into three kinds of signals: Pointing to left, Pointing to right, and Non-pointing from the students’ field of vision during class. The pointing focus may be at a blackboard, an electronic whiteboard, or a project screen.

Identifying human activities is an important task that relies on video sequences or sensor data for prediction. This is a challenging computer vision task due to the diversity and complexity of input data, especially in applications like human motion recognition in videos. Typically, techniques such as extracting skeletal points are used to capture motion features, determining if actions are triggered or meet specific criteria, such as in yoga pose detection [26], Parkinson’s disease diagnosis based on gait analysis [27], and predicting pedestrian intentions [28], rather than analyzing the entire video comprehensively. This approach helps to reduce the amount of data while enhancing efficiency and accuracy in activity recognition. Some studies [29,30,31,32,33] discuss the methodologies for detecting 2D or 3D keypoints in human poses, offering insights into the analysis of motion, gesture, posture, or action. Due to the intricacies and longer processing time associated with 3D detection, we choose a 2D approach.

We exploit a computer vision-based machine learning library MediaPipe to find the teacher’s skeleton. MediaPipe provides complete models for many different tasks, including pose [34,35], hands [36], face [37], and so on. In this study, MediaPipe BlazePose [34,35], detecting 33 landmarks for body pose tracking, is used to detect the teacher’s skeleton in the video sequence. BlazePose, despite supporting only single-person detection, utilizes heatmaps and regression for keypoint coordinates, making it generally faster than OpenPose. Therefore, we adopt BlazePose, which balances accuracy with computational efficiency to better meet the needs of mobile and real-time applications.Figure 1 depicts the 33 landmarks detected by MediaPipe BlazePose and their numeric labels. Table 1 provides a detailed list of these landmarks, each with their corresponding BlazePose landmark names.

These 33 landmarks are from various parts of the human body. They were specifically chosen to provide a comprehensive understanding of the human pose, enabling the estimation of rotation, size, and position of the region of interest [34]. These landmarks represent specific points of interest of the pose measurement.

Figure 2 shows some examples of the landmarks extracted by MediaPipe BlazePose and classified according to the three kinds of the teacher’s hand signals. Our method aims to find simple rules to analyze the three recognized signals in the students’ field of vision to facilitate fast implementation.

Only the coordinates of the shoulder, wrist, and hip are required in our method. The initial status of the signal of a video sequence is Non-pointing within the three kinds of signals. The Pointing to left and the Pointing to right signals should meet in Equation (1) and as indicated in Figure 3. Let X and Y represent the horizontal and the vertical coordinates in a video frame, respectively. The coordinates of the wrist and the shoulder are denoted as (X_wrist, Y_wrist) and (X_shoulder, Y_shoulder), respectively. According to the vertical coordinates of the wrist and the shoulder, R is defined in Equation (1).

R = \{\begin{matrix} \frac{D}{L}, if Y_{wrist} \geq Y_{shoulder} \\ \frac{D_{x}}{L}, o t h e r w i s e \end{matrix}

(1)

In Equation (1) and Figure 3, L represents the Euclidean distance between the center of two shoulder landmarks and the center of two hip landmarks, D indicates the Euclidean distance of shoulder and wrist landmarks, and D_x denotes the absolute difference between the horizontal coordinates of shoulder and wrist landmarks. The essential criterion for pointing signals is R > Th_R, which is an adjustable threshold and is empirically set to 0.5 in our algorithm.

If the criterion of R > Th_R is not satisfied, the status will be Non-pointing; if it is satisfied, the pointing direction will be further determined. For the determination of the pointing direction, Pointing to left means that the horizontal coordinate of the wrist landmark is smaller than that of the corresponding shoulder landmark (X_wrist < X_shoulder) for either the right hand or the left hand; and Pointing to right is determined by the condition that the horizontal coordinate of the wrist landmark is larger than that of corresponding shoulder landmark (X_wrist > X_shoulder) for either the right hand or the left hand. A flowchart for the criterion of status for Pointing to left, Pointing to right, or Non-pointing according to the skeleton landmarks is demonstrated in Figure 4.

The skeleton detection method aims to continuously follow and detect the X and Y coordinates of different landmarks on a single human body and to obtain good detection features by adjusting the minimum detection confidence and the minimum tracking confidence of the MediaPipe BlazePose model (detector and tracker). In the current study, the minimum detection confidence is set to 0.1, which means that the detector needs a prediction confidence of greater or equal to 10% to detect landmarks. Because the content of the blackboard, electronic whiteboard, or project screen may be usually complex, setting a lower minimum detection confidence can easily capture the landmarks of the teacher’s skeleton. The minimum tracking confidence is set to 0.9, which means that if the confidence of the tracked landmarks is greater than or equal to 90%, the tracked landmarks are valid. If the confidence is lower than 90%, the detector is called in the next frame to re-track landmarks.

For each frame, the landmarks of the left hand are first to be checked, and then the right hand. The lasting duration of any of these three kinds of signals should be maintained at least Th_t, and if the current status is different from the previous status, then the recognized signal can be triggered. Th_t is empirically set to 1.5 s in our method to avoid the unintended motion of the arms. If the recognized signals are triggered by both hands, the later status is given precedence over the previous one. In other words, if the recognized signals of both hands happen in the same frame, the result of the right hand has higher priority. However, the recognized pointing signal of Pointing to left or Pointing to right has higher priority than Non-pointing in the same frame.

Figure 5a,b depict the signal time series of

\frac{D}{L}

and

\frac{D_{x}}{L}

, respectively. In the figures, the horizontal axis represents the frame number, the red broken lines indicate the threshold of Th_R (0.5), and the yellow background shows the selection of

\frac{D}{L}

\frac{D_{x}}{L}

signal according to whether the vertical coordinate of the wrist is greater than or equal to that of the shoulder. Finally, the signals with the yellow background will be decided as pointing signals by the essential criterion of Equation (1). Figure 5c depicts the combined

\frac{D}{L}

and

\frac{D_{x}}{L}

signals, in which the blue line comes from

\frac{D}{L}

and the green line comes from

\frac{D_{x}}{L}

. The purple and the orange backgrounds indicate the durations of the Pointing to right and the Pointing to left signals, respectively, the blue background represents the duration of the Non-pointing signal, and the slashed background is the lasting duration of signal switching. Nine video frames captured in the video sequence are marked as No. 1 to No. 9. These frames can be mapped to the triggered frame numbers of recognized signals in the combined signal diagram accordingly.

3. Experimental Results

We collected five in-class video sequences (video 1–video 5) which were manually labeled by a highly experienced teacher in inclusive education to evaluate the recognition performance of the proposed method. Each of the five video sequences depicts a teacher delivering a lecture using a blackboard, electronic whiteboard, or projection screen, with students seated in the classroom. The frame numbers of the five video sequences are 71,678, 73,285, 71,202, 46,404, and 32,992 with a frame rate of 30 frames/s. The resolution is 1920 × 1080 for video 1-video 4 and 854 × 480 for video 5. This study was approved by The Institutional Review Board / Ethics Committee of Mennonite Christian Hospital, Taiwan.

Accuracy denotes the ratio of the number of correctly classified signals to the total number of signals. Equations (2)–(4) represent the Sensitivity_i, Specificity_i, and Precision_i defined by True Positive_i (TP_i), True Negative_i (TN_i), False Positive_i (FP_i), and False Negative_i (FN_i) values for the class label i

\in

{Pointing to left, Pointing to right, Non-pointing}. Equation (5) denotes the F₁ score_i defined by the Precision_i and Recall_i (Sensitivity_i). The macro-average [38] is calculated by taking the average of each metric of the three classes.

{Sensitivity}_{i} ({Recall}_{i}) = \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}}

(2)

{Specificity}_{i} = \frac{{TN}_{i}}{{TN}_{i} + {FP}_{i}}

(3)

{Precision}_{i} = \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}}

(4)

{F_{1} score}_{i} = \frac{2 \times {Precision}_{i} \times {Recall}_{i}}{{Precision}_{i} + {Recall}_{i}}

(5)

Figure 6 shows the confusion matrices for the five video frames. Standard assessments to measure the performance of the proposed method include accuracy, sensitivity, specificity, precision, and F₁ score in Table 2. Each metric of macro-average is also listed. The average accuracy, sensitivity, specificity, precision, and F₁ score achieve 88.31%, 91.03%, 93.99%, 86.32%, and 88.03%, respectively. Moreover, the average accuracy ranges from 83.66% to 91.30%, the average sensitivity ranges from 87.38% to 94.52%, the average specificity ranges from 91.59% to 95.64%, the average precision ranges from 80.88% to 89.97%, and the average F₁ score ranges from 83.40% to 91.09% for the five video sequences.

As shown in Table 2, we can see that the accuracy of video 3 is lower than those of the other videos. This is because the teacher in this video often uses a telescopic pointer to point at focus. Therefore, the extension of the teacher’s hand is not as far as that of the teachers in other videos, and it makes the accuracy drop a little bit. Other videos in which the teacher is only pointing by hand have satisfactory results.

The performance may be affected by empirical values of Th_R and Th_t. If Th_R is chosen to be larger, it means that only when the teacher extends the arm fully can a pointing signal be detected. We choose Th_R as 0.5. Hence, the students can obtain the attention reminder for Pointing to left or Pointing to right in most cases. In Table 2, both averaged sensitivities of Pointing to left and Pointing to right are more than 95%. However, the average sensitivity of Non-pointing is about 82.41%. If the minimum lasting time Th_t is selected to be shorter, it may easily lead to rapid signal switching due to unintended movements of the teacher’s hands. Therefore, we select Th_t as 1.5 s for practical applications. Our continuous recognition approach provides three-class recognition and considers the time lag to avoid unintended and unnecessary rapid changes. The performances and advantages of our method are competent and feasible for applications in the classroom.

4. Discussion

Our proposed method differs from those described in [12,13,14] in the following aspects. While they focus on binary classification tasks using the single frame in the video sequence, our algorithm is tailored for continuous recognition with classes designated as Pointing to left, Pointing to right, and Non-pointing. In addition, our method considers avoiding unintended arm movements and triggers signal recognition only when a signal transition is maintained for a sufficient time interval. Table 3 shows a comparison of the proposed methods with those in [12,13,14]. The method described in [12] is able to recognize situations where the teacher extends his/her arm and gestures towards the whiteboard. However, our approach allows teachers to use a telescoping pointer to guide attention. In [12], the accuracy of pointing gesture recognition at the first stage is 90%. By combining whiteboard detection, the final precision is 89.2% and the recall is 88.9% for instructional pointing gestures. In [13], the final accuracy exceeds 90% after the training process. In [14], the accuracy depends on the training and testing methods, which combine the different available datasets, ranging between 54% and 78%. The available training and testing datasets included acted recordings and real-class recordings; in both their semi-automatic classification versions and manual classification versions. While [12] and [14] rely on OpenPose, we utilize Mediapipe BlazePose with faster processing speed for pose information capture. Our method achieves competent performance with an accuracy of 88.31%, a sensitivity (recall) of 91.03%, and a precision of 86.32%, potentially coupled with expedited implementation speed compared to those machine learning or deep learning models in previous works, which can be attributed to our simple algorithm.

The methods provided in [39,40] use continuous wave radar for real-time hand gesture recognition and achieve high accuracy rates. However, our choice of a video-based approach is motivated by considerations of accessibility, ease of implementation, and practical utility in inclusive education environments where resources and infrastructure may be limited. Given that most teachers have smart phones, this video-based method becomes highly convenient and widely accessible.

The fundamental possibility of recognizing audio-visual speech and gestures using sensors on mobile devices is demonstrated in [41]. Studies have shown that wearable devices can effectively improve attention in educational settings through smart watch vibrations or visual stimuli from smart glasses [42,43,44,45,46]. Therefore, future implementations of our recognition algorithm can be embedded in wearable devices such as smart watches or smart glasses. The benefit of sending recognized signals to students’ smart watches or smart glasses is to provide real-time feedback and assistance. Compared to smart phones, these devices provide attention cues to students in a less distracting manner. For students with attention deficits, this can help them stay focused, ameliorate hyperactive behaviors [43], and improve cognition [46]. Using wearable devices can make these students less conspicuous when utilizing them in class, thereby enhancing the user-friendliness of assistive technology.

5. Conclusions

This paper introduces a method aimed at recognizing teachers’ hand signals to aid students with attention deficits in the classroom. By analyzing non-verbal behavior, our study continuously recognizes the teachers’ hand signals. According to the simple rules from the body landmarks of the teacher’s skeleton using MediaPipe BlazePose, the proposed method dynamically detects the teacher’s hand signals and can provide attention cues during class. The experimental results show competent performance compared with the other research. The proposed mechanism provides a powerful tool for enhancing engagement among students who display attentional challenges within traditional, verbally driven educational milieus. This approach augments the inclusiveness of the teaching environment by interpreting instructors’ non-verbal cues. This innovative work paves the way for further exploration to refine non-verbal communication tools within frameworks of inclusive education classrooms. The findings from this study demonstrate robust performance, and they represent a significant progression in the creation of adaptive educational spaces that are meticulously designed to accommodate the needs of learners with attention deficits, by offering personalized support through less conspicuous means.

In the future, continuous recognition results can be sent to the smart watches or glasses worn by the students to be a practical tool for students with attention deficits. Additionally, deep learning techniques can be performed. Lightweight CNNs suitable for storage on wearable devices with small memory capacity can be trained for accurate classification. Silent and non-verbal messages are of high value in education. We look forward to helping students shift their attention smoothly to the teacher’s teaching rhythm.

Author Contributions

Conceptualization, I.D.S.C., C.-M.Y., S.-S.W. and M.-J.C.; Data curation, I.D.S.C., S.-S.W. and Y.-H.L.; Methodology, I.D.S.C., C.-M.Y. and M.-J.C.; Software, I.D.S.C.; Formal analysis, C.-M.Y.; Supervision, C.-K.Y., M.-J.C. and C.-H.Y. Writing—original draft preparation, I.D.S.C., C.-M.Y., S.-S.W. and M.-J.C.; Writing—review and editing, C.-K.Y., C.-H.Y. and Y.-H.L.; Project administration, C.-K.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data is unavailable due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hurwitz, S.; Watson, L.R. Joint attention revisited: Finding strengths among children with autism. Autism 2016, 20, 538–550. [Google Scholar] [CrossRef] [PubMed]
Lai, Y.H.; Chang, Y.C.; Ma, Y.W.; Huang, S.Y.; Chao, H.C. Improvement of ADHD Behaviors with AI Perception Technology. In Proceedings of the International Cognitive Cities Conference, Kyoto, Japan, 3–6 September 2019. [Google Scholar]
Al Hazmi, A.N.; Ahmad, A.C. Universal design for learning to support access to the general education curriculum for students with intellectual disabilities. World J. Educ. 2018, 8, 66–72. [Google Scholar] [CrossRef]
Lidstone, D.E.; Mostofsky, S.H. Moving toward understanding autism: Visual-motor integration, imitation, and social skill development. Pediatr. Neurol. 2021, 122, 98–105. [Google Scholar] [CrossRef] [PubMed]
Shkedy, G.; Shkedy, D.; Sandoval-Norton, A.H.; Fantaroni, G.; Castro, J.M.; Sahagun, N.; Christopher, D. Visual communication analysis (VCA): Implementing self-determination theory and research-based practices in special education classrooms. Cogent Psychol. 2021, 8, 1875549. [Google Scholar] [CrossRef]
Baragash, R.S.; Al-Samarraie, H.; Alzahrani, A.I.; Alfarraj, O. Augmented reality in special education: A meta-analysis of single-subject design studies. Eur. J. Spec. Needs Educ. 2020, 35, 382–397. [Google Scholar] [CrossRef]
Garzón, J.; Pavón, J.; Baldiris, S. Systematic review and meta-analysis of augmented reality in educational settings. Virtual Real. 2019, 23, 447–459. [Google Scholar] [CrossRef]
Zhong, D.; Chen, L.; Feng, Y.; Song, R.; Huang, L.; Liu, J.; Zhang, L. Effects of virtual reality cognitive training in individuals with mild cognitive impairment: A systematic review and meta-analysis. Int. J. Geriatr. Psychiatry 2021, 36, 1829–1847. [Google Scholar] [CrossRef] [PubMed]
Sam, C.; Naicker, N.; Rajkoomar, M. Meta-analysis of artificial intelligence works in ubiquitous learning environments and technologies. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 603–613. [Google Scholar] [CrossRef]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef]
Liu, T.; Chen, Z.; Wang, X. Automatic Instructional Pointing Gesture Recognition by Machine Learning in the Intelligent Learning Environment. In Proceedings of the 2019 4th International Conference on Distance Education and Learning, Shanghai, China, 24–27 May 2019. [Google Scholar]
Wang, J.; Liu, T.; Wang, X. Human hand gesture recognition with convolutional neural networks for K-12 double-teachers instruction mode classroom. Infrared Phys. Tech. 2020, 111, 103464. [Google Scholar] [CrossRef]
Hernández Correa, J.; Farsani, D.; Araya, R. An Application of Machine Learning and Image Processing to Automatically Detect Teachers’ Gestures. In Proceedings of the International Conference on Computational Collective Intelligence, Da Nang, Vietnam, 30 November–3 December 2020. [Google Scholar]
Gu, Y.; Hu, J.; Zhou, Y.; Lu, L. Online Teaching Gestures Recognition Model Based on Deep Learning. In Proceedings of the 2020 International Conference on Networking and Network Applications, Haikou, China, 10–13 December 2020. [Google Scholar]
Araya, R.; Sossa-Rivera, J. Automatic detection of gaze and body orientation in elementary school classrooms. Front. Robot. AI 2021, 8, 729832. [Google Scholar] [CrossRef] [PubMed]
Yoon, H.Y.; Kang, S.; Kim, S. A non-verbal teaching behaviour analysis for improving pointing out gestures: The case of asynchronous video lecture analysis using deep learning. J. Comput. Assist. Learn. 2024, 40, 1006–1018. [Google Scholar] [CrossRef]
Liu, H.; Yao, C.; Zhang, Y.; Ban, X. GestureTeach: A gesture guided online teaching interactive model. Comput. Animat. Virtual Worlds 2024, 35, e2218. [Google Scholar] [CrossRef]
Chen, Z.; Feng, X.; Liu, T.; Wang, C.; Zhang, C. A Computer-Assisted Teaching System with Gesture Recognition Technology and Its Applications. In Proceedings of the International Conference on Digital Technology in Education, Taipei, Taiwan, 6–8 August 2017. [Google Scholar]
Chiang, H.H.; Chen, W.M.; Chao, H.C.; Tsai, D.L. A virtual tutor movement learning system in eLearning. Multimed. Tools Appl. 2019, 78, 4835–4850. [Google Scholar] [CrossRef]
Goto, T.; Sakurai, D.; Ooi, S. Proposal of Feedback System Based on Skeletal Analysis in Physical Education Classes. In Proceedings of the 4th International Conference on Education and Multimedia Technology, Kyoto, Japan, 19–22 July 2020. [Google Scholar]
Amrutha, K.; Prabu, P.; Paulose, J. Human Body Pose Estimation and Applications. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies, Kuala Lumpur, Malaysia, 27–29 November 2021. [Google Scholar]
Farsani, D.; Lange, T.; Meaney, T. Gestures, systemic functional linguistics and mathematics education. Mind Cult. Act. 2022, 29, 75–95. [Google Scholar] [CrossRef]
Kure, A.E.; Brevik, L.M.; Blikstad-Balas, M. Digital skills critical for education: Video analysis of students' technology use in Norwegian secondary English classrooms. J. Comput. Assist. Learn. 2023, 39, 269–285. [Google Scholar] [CrossRef]
Kim, Y.; Soyata, T.; Behnagh, R.F. Towards emotionally aware AI smart classroom: Current issues and directions for engineering and education. IEEE Access 2018, 6, 5308–5331. [Google Scholar] [CrossRef]
Swain, D.; Satapathy, S.; Acharya, B.; Shukla, M.; Gerogiannis, V.C.; Kanavos, A.; Giakovis, D. Deep learning models for yoga pose monitoring. Algorithms 2022, 15, 403. [Google Scholar] [CrossRef]
Connie, T.; Aderinola, T.B.; Ong, T.S.; Goh, M.K.O.; Erfianto, B.; Purnama, B. Pose-based gait analysis for diagnosis of Parkinson’s disease. Algorithms 2022, 15, 474. [Google Scholar] [CrossRef]
Gesnouin, J.; Pechberti, S.; Bresson, G.; Stanciulescu, B.; Moutarde, F. Predicting intentions of pedestrians from 2D skeletal pose sequences with a representation-focused multi-branch deep learning network. Algorithms 2020, 13, 331. [Google Scholar] [CrossRef]
Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A Simple Yet Effective Baseline for 3D Human Pose Estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Luvizon, D.C.; Picard, D.; Tabia, H. 2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Rogez, G.; Weinzaepfel, P.; Schmid, C. LCR-Net++: Multi-person 2D and 3D pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1146–1161. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Liu, H.; Ding, R.; Liu, M.; Wang, P.; Yang, W. Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans. Multimed. 2023, 25, 1282–1293. [Google Scholar] [CrossRef]
Zheng, C.; Wu, W.; Chen, C.; Yang, T.; Zhu, S.; Shen, J.; Kehtarnavaz, N.; Shah, M. Deep Learning-based human pose estimation: A survey. ACM Comput. Surv. 2023, 56, 11. [Google Scholar] [CrossRef]
Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. In Proceedings of the CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Seattle, WA, USA, 15 June 2020. [Google Scholar]
MediaPipe. Available online: https://developers.google.com/mediapipe/solutions/vision/pose_landmarker/ (accessed on 2 April 2024).
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. MediaPipe Hands: On-device Real-Time Hand Tracking. In Proceedings of the CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Seattle, WA, USA, 15 June 2020. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.; Lee, J.; et al. MediaPipe: A Framework for Perceiving and Processing Reality. In Proceedings of the CVPR Workshop on Computer Vision for Augmented and Virtual Reality, Long Beach, CA, USA, 17 June 2019. [Google Scholar]
Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
Yu, M.; Kim, N.; Jung, Y.; Lee, S. A frame detection method for real-time hand gesture recognition systems using CW-radar. Sensors 2020, 20, 2321. [Google Scholar] [CrossRef] [PubMed]
Choi, J.W.; Ryu, S.J.; Kim, J.H. Short-range radar based real-time hand gesture recognition using LSTM encoder. IEEE Access 2019, 7, 33610–33618. [Google Scholar] [CrossRef]
Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef] [PubMed]
Sidiya, K.; Alzanbagi, N.; Bensenouci, A. Google Glass and Apple Watch Will They Become Our Learning Tools? In Proceedings of the 12th Learning and Technology Conference, Jeddah, Saudi Arabia, 12–13 April 2015.
Lai, M.C.; Chiang, M.S.; Shih, C.T.; Shih, C.H. Applying a vibration reminder to ameliorate the hyperactive behavior of students with attention deficit hyperactivity disorder in class. J. Dev. Phys. Disabil. 2018, 30, 835–844. [Google Scholar] [CrossRef]
Zarraonandia, T.; Díaz, P.; Montero, Á.; Aedo, I.; Onorati, T. Using a google glass-based classroom feedback system to improve students to teacher communication. IEEE Access 2019, 7, 16837–16846. [Google Scholar] [CrossRef]
Ayearst, L.E.; Brancaccio, R.; Weiss, M.D. An open-label study of a wearable device targeting ADHD, executive function, and academic performance. Brain Sci. 2023, 13, 1728. [Google Scholar] [CrossRef]
Whitmore, N.; Chan, S.; Zhang, J.; Chwalek, P.; Chin, S.; Maes, P. Improving Attention Using Wearables via Haptic and Multimodal Rhythmic Stimuli. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024. [Google Scholar]

Figure 1. Landmarks of MediaPipe BlazePose model [35].

Figure 2. Examples of the three kinds of hand signals. (a) Pointing to left. (b) Pointing to right. (c) Non-pointing.

Figure 3. Illustration of pointing signal detection. (a)

Y_{wrist} \geq Y_{shoulder}

. (b)

Y_{wrist} < Y_{shoulder}

Figure 3. Illustration of pointing signal detection. (a)

Y_{wrist} \geq Y_{shoulder}

. (b)

Y_{wrist} < Y_{shoulder}

Figure 4. Flowchart of recognition algorithm according to skeletal landmarks.

Figure 5. An example of the continuous recognition of hand signals. (a)

\frac{D}{L}

. (b)

\frac{D_{x}}{L}

. (c)

\frac{D}{L}

and

\frac{D_{x}}{L}

Figure 5. An example of the continuous recognition of hand signals. (a)

\frac{D}{L}

. (b)

\frac{D_{x}}{L}

. (c)

\frac{D}{L}

and

\frac{D_{x}}{L}

Figure 6. Confusion matrices for video sequences. (a) Video 1. (b) Video 2. (c) Video 3. (d) Video 4. (e) Video 5.

Table 1. Landmark names of MediaPipe BlazePose model [35].

Number	Name	Number	Name	Number	Name
0	Nose	11	Left shoulder	22	Right thumb
1	Left eye inner	12	Right shoulder	23	Left hip
2	Left eye	13	Left elbow	24	Right hip
3	Left eye outer	14	Right elbow	25	Left knee
4	Right eye inner	15	Left wrist	26	Right knee
5	Right eye	16	Right wrist	27	Left ankle
6	Right eye outer	17	Left pinky	28	Right ankle
7	Left ear	18	Right pinky	29	Left heel
8	Right ear	19	Left index	30	Right heel
9	Mouth left	20	Right index	31	Left foot index
10	Mouth right	21	Left thumb	32	Right foot index

Table 2. Performance of the proposed recognition method.

Video	Recognized Signal	Accuracy	Sensitivity	Specificity	Precision	F₁ Score
1	Pointing toleft	90.16%	100.00%	98.10%	89.47%	94.44%
	Pointing toright		97.67%	88.61%	82.35%	89.36%
	Non-pointing		82.26%	98.33%	98.08%	89.48%
	Macro Average	-	93.31%	95.01%	89.97%	91.09%
2	Pointing toleft	91.30%	100.00%	97.64%	78.57%	88.00%
	Pointing toright		100.00%	89.29%	85.71%	92.31%
	Non-pointing		83.56%	100.00%	100.00%	91.04%
	Macro Average	-	94.52%	95.64%	88.09%	90.45%
3	Pointing toleft	83.66%	100.00%	93.79%	69.44%	81.96%
	Pointing toright		89.19%	89.06%	82.50%	85.71%
	Non-pointing		75.73%	91.92%	90.70%	82.54%
	Macro Average	-	88.31%	91.59%	80.88%	83.40%
4	Pointing toleft	90.18%	92.93%	96.00%	94.85%	93.88%
	Pointing toright		95.83%	95.50%	71.88%	82.14%
	Non-pointing		86.14%	93.50%	91.58%	88.78%
	Macro-Average	-	91.63%	95.00%	86.10%	88.27%
5	Pointing toleft	86.24%	85.11%	93.55%	90.91%	87.91%
	Pointing toright		92.68%	97.18%	88.37%	90.47%
	Non-pointing		84.34%	87.41%	80.46%	82.35%
	Macro Average	-	87.38%	92.71%	86.58%	86.91%
Average	Pointing toleft	88.31%	95.61%	95.82%	84.65%	89.24%
	Pointing toright		95.07%	91.93%	82.16%	88.00%
	Non-pointing		82.41%	94.23%	92.16%	86.84%
	Macro-Average	-	91.03%	93.99%	86.32%	88.03%

Table 3. A comparison of the proposed method with those of [12,13,14].

Method	Classification Task	Classes	Keypoint Extraction	Classification	Accuracy
[12]	Pointing or not	2	OpenPose	Non-linear neural network	90%
[13]	Pointing or not	2	Convolutional neural network	Convolutional neural network	over 90%
[14]	Gesticulating or not	2	OpenPose	Machine learning	54~78%
Proposed method	Pointing to left Pointing to right Non-pointing	3	MediaPipe	Simple rules	88.31%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, I.D.S.; Yang, C.-M.; Wu, S.-S.; Yang, C.-K.; Chen, M.-J.; Yeh, C.-H.; Lin, Y.-H. Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits. Algorithms 2024, 17, 300. https://doi.org/10.3390/a17070300

AMA Style

Chen IDS, Yang C-M, Wu S-S, Yang C-K, Chen M-J, Yeh C-H, Lin Y-H. Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits. Algorithms. 2024; 17(7):300. https://doi.org/10.3390/a17070300

Chicago/Turabian Style

Chen, Ivane Delos Santos, Chieh-Ming Yang, Shang-Shu Wu, Chih-Kang Yang, Mei-Juan Chen, Chia-Hung Yeh, and Yuan-Hong Lin. 2024. "Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits" Algorithms 17, no. 7: 300. https://doi.org/10.3390/a17070300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Continuous Recognition of Teachers’ Hand Signals for Students with Attention Deficits

Abstract

1. Introduction

2. Proposed Method

3. Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI