Learning the Relative Dynamic Features for Word-Level Lipreading
<p>The overview of the Lip Slow-Fast (LSF). As shown in the picture above, the proposed model is a two-stream network. At first, the lip region sequence with different sampling intervals is input into two CNN streams; the above one is the slow channel, and the bottom one is the fast channel. The gray convolution is a shallow 3D CNN, and the others are 2D CNN. Then, the output feature vector will be input to the back-end model, which is a 3-layer Bi-Gated Recurrent Unit (BiGRU). In early fusion, the two feature vectors will concatenate together first and then do the sequence modeling; in late fusion, two feature vectors will send to the back-end model separately and concatenate together last. Finally, after a full connection layer, the model will predict the result of the input lip sequence.</p> "> Figure 2
<p>The working process of TSM. As shown in the picture, (<b>a</b>) shows a part of the characteristic diagram in the channel dimension and time dimension; (<b>b</b>) shows the working process of TSM. First, divide the feature map into two parts, moving the channel only in the first half and leaving the second half unchanged. For the moving part, the first-half channels move forward, and the last-half channels move backward for the moving part. Then, pad the empty part with “0”, and delete the extra part directly.</p> "> Figure 3
<p>The implementation of the residual block in the proposed front-end. As shown in the picture above, in each residual block, the input data will first obtain the short-term time information through TSM. Then, go through the conventional convolution composed of two CBR. Finally, use the SE module to reweight the channel correlation.</p> "> Figure 4
<p>A sample of “DEATH” in LRW. It can be observed that the range of obvious changes in lip motion is from frame 3 to frame 16. From frame 17 on, the change in the speaker’s lip shape is not obvious.</p> "> Figure 5
<p>Sparse difference sampling. Given a sequence of 29 frames, we divided it into seven segments, each of which is four frames. Then subtract the previous sub-segment from the current sub-segment, for example subtracting sub-segment 1 from sub-segment 2, sub-segment 2 from sub-segment 3, etc. Finally, splice these segments in the temporal dimension to get a 24-frame result.</p> "> Figure 6
<p>An example sample during data processing. As shown above, (<b>a</b>) is a part of the original data in the Lipreading in the Wild (LRW) dataset. (<b>b</b>) is the result of using the Dlib toolkit to get facial landmarks while the green rectangle is the region of the face, and the blue point is the facial landmarks. (<b>c</b>) is the process result of data crop and lip alignment. (<b>d</b>,<b>e</b>) is the result of mix-up and cutout during training.</p> "> Figure 7
<p>Statistics of accuracy for each class in LRW. The blue histogram is the baseline model we compared, while the orange histogram is LSF.</p> "> Figure 8
<p>Statistics of accuracy for each class in LRW-1000. The blue histogram is the baseline model we compared, while the orange histogram is LSF.</p> ">
Abstract
:1. Introduction
- Firstly, a new front-end model with two effective CNN streams and a lateral connection is proposed in this paper to capture the relative dynamic features of lip motion.
- Secondly, for each component in the front-end model, we explored a more effective convolution structure and achieved an improvement of about 8%.
- Thirdly, we verified the impact of sampling methods due to the short duration of word-level lipreading samples on the extraction of lip motion information.
- Then, we discussed and analyzed the fusion methods of the two-stream front end model with the back-end model.
2. Methods
2.1. Front-End Model
2.2. Different Structures of Front-End
2.3. Back-End Model
2.4. Sampling Methods
2.4.1. Interval Sampling
2.4.2. Sparse Difference Sampling
2.5. Fusion Methods
2.5.1. Early Fusion
2.5.2. Late Fusion
3. Results
3.1. Datasets
3.2. Implementation Details
3.2.1. Data Processing
3.2.2. Training Details
3.3. Ablation Experiment
3.3.1. The Convolution Structure of Front-End
3.3.2. Sampling Methods
3.3.3. Fusion Methods
3.3.4. Comparison with Relative Works
3.3.5. Statistical Results
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Luo, M.; Yang, S.; Chen, X. Synchronous bidirectional learning for multilingual lip reading. arXiv 2020, arXiv:2005.03846. [Google Scholar]
- Assael, Y.M.; Shillingford, B.; Whiteson, S. Lipnet: End-to-end sentence-level lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Chung, J.S.; Zisserman, A. Lipreading in the wild. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Cham, Switzerland, 2016; pp. 87–103. [Google Scholar]
- Stafylakis, T.; Tzimiropoulos, G. Combining residual networks with LSTMs for lipreading. arXiv 2017, arXiv:1703.04105. [Google Scholar]
- Martinez, B.; Ma, P.; Petridis, S.; Pantic, M. Lipreading using temporal convolutional networks. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6319–6323. [Google Scholar]
- Hao, M.; Mamut, M.; Yadikar, N.; Aysa, A.; Ubul, K. How to Use Time Information Effectively? Combining with Time Shift Module for Lipreading. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7988–7992. [Google Scholar]
- Stafylakis, T.; Khan, M.H.; Tzimiropoulos, G. Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs. Comput. Vis. Image Underst. 2018, 176, 22–32. [Google Scholar] [CrossRef] [Green Version]
- Weng, X.; Kitani, K. Learning spatio-temporal features with two-stream deep 3d cnns for lipreading. arXiv 2019, arXiv:1905.02540. [Google Scholar]
- Voutos, Y.; Drakopoulos, G.; Chrysovitsiotis, G. Multimodal Lip-Reading for Tracheostomy Patients in the Greek Language. Computers 2022, 11, 34. [Google Scholar] [CrossRef]
- Kumar, L.A.; Renuka, D.K.; Rose, S.L. Deep Learning based Assistive Technology on Audio Visual Speech Recognition for Hearing Impaired. Int. J. Cogn. Comput. Eng. 2022, 2022, 3. [Google Scholar] [CrossRef]
- Kim, M.; Hong, J.; Park, S.J.; Ro, Y.M. Multi-modality associative bridging through memory: Speech sound recollected from face video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–27 October 2021; pp. 296–306. [Google Scholar]
- Kim, M.; Yeo, J.H.; Ro, Y.M. Distinguishing Homophenes using Multi-head Visual-audio Memory for Lip Reading. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022. [Google Scholar]
- Yang, C.C.; Fan, W.C.; Yang, C.F. Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Available online: https://www.aaai.org/AAAI22Papers/AAAI-6163.YangC.pdf (accessed on 25 March 2022).
- Feichtenhofer, C.; Fan, H.; Malik, J. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6202–6211. [Google Scholar]
- Yang, S.; Zhang, Y.; Feng, D.; Yang, M.; Wang, C.; Xiao, J.; Chen, X. LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; pp. 1–8. [Google Scholar]
- Kim, D.; Lan, T.; Zou, C.; Xu, N.; Plummer, B.A.; Sclaroff, S.; Medioni, G. MILA: Multi-Task Learning from Videos via Efficient Inter-Frame Attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 2219–2229. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 7083–7093. [Google Scholar]
- Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1895–1904. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, Nertherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
- Zhang, Y.; Yang, S.; Xiao, J.; Shan, S.; Chen, X. Can we read speech beyond the lips? Rethinking roi selection for deep visual speech recognition. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 356–363. [Google Scholar]
- Davis, E.K. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
- Wang, C. Multi-grained spatio-temporal modeling for lip-reading. arXiv 2019, arXiv:1908.11618. [Google Scholar]
- Zhao, X.; Yang, S.; Shan, S.; Chen, X. Mutual information maximization for effective lip reading. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 420–427. [Google Scholar]
- Xiao, J.; Yang, S.; Zhang, Y.; Shan, S.; Chen, X. Deformation flow based two-stream network for lip reading. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; pp. 364–370. [Google Scholar]
- Wiriyathammabhum, P. SpotFast networks with memory augmented lateral transformers for lipreading. In Proceedings of the International Conference on Neural Information Processing. Springer, Cham, Switzerland; 2020; pp. 554–561. [Google Scholar]
- Feng, D.; Yang, S.; Shan, S. Learn an Effective Lip Reading Model without Pains. arXiv 2020, arXiv:2011.07557. [Google Scholar]
Component | Convolution Structure | ||
---|---|---|---|
Full 3D LSF | 3D + 2D LSF | Full 2D LSF | |
Convolution Head | 3D Conv | 3D Conv | 2D Conv |
Convolution Layer | 3D Conv | 2D Conv | 2D Conv |
Lateral Connection | 3D Conv | 2D Conv | 2D Conv |
LRW [3] | LRW-1000 [15] | |
---|---|---|
Source | BBC | CCTV |
Language | English | Chinese |
Level | Word | Word |
Speakers | More than 1000 | More than 2000 |
Classes | 500 | 1000 |
Resolution | 256 × 256 | Multi |
Head Angle | Multi | Multi |
Background | Multi | Multi |
Duration | 1.16s | Multi |
Total Samples | 538,766 | 718,018 |
Module | Accuracy | |
---|---|---|
Full 2D ResNet-18 | 3D ResNet-18 | |
/ | 80.34% | 83.14% |
+TSM | 83.83% | 83.52% |
+SE-TSM | 84.13% | 83.60% |
Augment | Accuracy | |
---|---|---|
Full 2D ResNet-18 | 3D + 2D ResNet-18 | |
Cutout | 84.13% | 83.60% |
MixUp | 80.92% | 84.14% |
CutMix | 75.06% | 79.11% |
Model Structure | Accuracy |
---|---|
Full 3D LSF | 80.66% |
Full 2D LSF | 88.42% |
3D + 2D LSF | 88.52% |
Sampling Methods | Sampling Number (Frames) | Accuracy |
---|---|---|
Interval Sampling | 1–2 | 88.47% |
1–3 | 88.52% | |
1–6 | 87.46% | |
2–6 | 88.14% | |
1–12 | 87.06% | |
Differ Sampling | 4 | 84.37% |
Fusion Methods | Accuracy | |
---|---|---|
Full 2D LSF | 3D + 2D LSF | |
Early Fusion | 88.42% | 88.52% |
Late Fusion | 86.64% | 87.88% |
Year | Method | Accuracy | |
---|---|---|---|
LRW | LRW-1000 | ||
2019 | Multi-Grained [22] | 83.30% | 36.90% |
2019 | I3D [8] | 84.11% | - |
2020 | GLMIN [23] | 84.41% | 38.79% |
2020 | MS-TCN [5] | 85.30% | 41.40% |
2020 | DFN [24] | 84.10% | 41.90% |
2020 | CBAM [20] | 85.02% | 45.24% |
2020 | SpotFast + transformer [25] | 84.40% | - |
2021 | TSM [6] | 86.23% | 44.60% |
2021 | BiGRU + MEM [11] | 85.40% 1 | 50.82% 1 |
2021 | SE-ResNet [26] | 88.40% | 55.70% |
2022 | Yang et al. [13] | 88.50% 1 | 50.50% 1 |
2022 | MS-TCN + MVM [12] | 88.50% 1 | 53.82% 1 |
Ours (Full 2D LSF) | 88.42% | 57.70% | |
Ours (2D + 3D LSF) | 88.52% | 58.17% |
Label | Acc | Label | Acc | Label | Acc | Label | Acc | Label | Acc |
---|---|---|---|---|---|---|---|---|---|
ABSOLUTELY | 100% | ACCUSED | 100.0% | AGREEMENT | 100.0% | ALLEGATIONS | 100.0% | BEFORE | 100.0% |
BUSINESSES | 100.0% | CAMERON | 100.0% | EVERYBODY | 100.0% | EVIDENCE | 100.0% | EXAMPLE | 100.0% |
FAMILY | 100.0% | FOLLOWING | 100.0% | INFLATION | 100.0% | INFORMATION | 100.0% | INQUIRY | 100.0% |
LEADERSHIP | 100.0% | MILITARY | 100.0% | OBAMA | 100.0% | OFFICIALS | 100.0% | OPERATION | 100.0% |
PARLIAMENT | 100.0% | PERHAPS | 100.0% | POSSIBLE | 100.0% | POTENTIAL | 100.0% | PRIME | 100.0% |
PROVIDE | 100.0% | REFERENDUM | 100.0% | RESPONSE | 100.0% | SCOTLAND | 100.0% | SERVICE | 100.0% |
SIGNIFICANT | 100.0% | TEMPERATURES | 100.0% | THEMSELVES | 100.0% | WEAPONS | 100.0% | WELFARE | 100.0% |
WESTMINSTER | 100.0% | WOMEN | 100.0% | MEMBERS | 100.0% | PEOPLE | 98.0% | POLITICIANS | 98.0% |
DIFFICULT | 97.8% | COMMUNITY | 97.8% | CUSTOMERS | 97.7% | EVENING | 97.7% | ECONOMY | 97.7% |
EDUCATION | 97.7% | FINANCIAL | 97.6% | CHILDREN | 97.6% | CHILDREN | 97.6% | REMEMBER | 97.5% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, H.; Yadikar, N.; Zhu, Y.; Mamut, M.; Ubul, K. Learning the Relative Dynamic Features for Word-Level Lipreading. Sensors 2022, 22, 3732. https://doi.org/10.3390/s22103732
Li H, Yadikar N, Zhu Y, Mamut M, Ubul K. Learning the Relative Dynamic Features for Word-Level Lipreading. Sensors. 2022; 22(10):3732. https://doi.org/10.3390/s22103732
Chicago/Turabian StyleLi, Hao, Nurbiya Yadikar, Yali Zhu, Mutallip Mamut, and Kurban Ubul. 2022. "Learning the Relative Dynamic Features for Word-Level Lipreading" Sensors 22, no. 10: 3732. https://doi.org/10.3390/s22103732