DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation

Aleksandr Timofeev¹¹,
Anastasiia Fadeeva¹²,
Andrei Afonin¹¹,
Claudiu Musat¹² &
…
Andrii Maksai¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14190))

Included in the following conference series:

International Conference on Document Analysis and Recognition

846 Accesses
1 Citations
1 Altmetric

Abstract

As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.

A. Timofeev and A. Afonin—Work done as a student researcher at Google Research, Zürich, Switzerland.

A. Timofeev and A. Fadeeva—These authors contributed equally to this work and share first authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Generative Ink: Data-Driven Computational Models for Digital Ink

GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Notes

1.
A notebook to test model inference is available here: https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.
2.
A notebook with test sets and model inference https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.

References

Aksan, E., Pece, F., Hilliges, O.: DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. In: SIGCHI Conference on Human Factors in Computing Systems. CHI 2018, New York, NY, USA. ACM (2018)
Google Scholar
Alonso, E., Moysset, B., Messina, R.O.: Adversarial generation of handwritten text images conditioned on sequences. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 481–486 (2019)
Google Scholar
Anil, C., et al.: Exploring length generalization in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=zSkYVeX7bC4
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.M.: Scheduled sampling for sequence prediction with recurrent neural networks. arXiv:abs/1506.03099 (2015)
Bishop, C.M.: Mixture density networks (1994)
Google Scholar
Carbune, V., et al.: Fast multi-language LSTM-based online handwriting recognition. Int. J. Doc. Anal. Recogn. (IJDAR) 23, 89–102 (2020)
Article Google Scholar
Chang, J.H.R., Shrivastava, A., Koppula, H.S., Zhang, X., Tuzel, O.: Style equalization: Unsupervised learning of controllable generative sequence models (2022). https://arxiv.org/abs/2110.02891
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations (2020)
Google Scholar
Chiu, C.C., et al.: RNN-T models fail to generalize to out-of-domain audio: causes and solutions, pp. 873–880 (2021). https://doi.org/10.1109/SLT48900.2021.9383518
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context (2019). https://doi.org/10.48550/ARXIV.1901.02860, https://arxiv.org/abs/1901.02860
Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.-Z.: BézierSketch: a generative model for scalable vector sketches. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 632–647. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_38
Chapter Google Scholar
Djioua, M., Plamondon, R.: An interactive system for the automatic generation of huge handwriting databases from a few specimens. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008)
Google Scholar
Elarian, Y., Abdel-Aal, R., Ahmad, I., Parvez, M.T., Zidouri, A.: Handwriting synthesis: classifications and techniques. Int. J. Document Anal. Recogn. (IJDAR) 17(4), 455–469 (2014). https://doi.org/10.1007/s10032-014-0231-x
Article Google Scholar
Fisher, R.A.: On the interpretation of from contingency tables, and the calculation of p. J. Roy. Stat. Soc. 85 (1922). http://www.medicine.mcgill.ca
Gan, J., Wang, W.: HIGAN: handwriting imitation conditioned on arbitrary-length texts and disentangled styles. In: AAAI Conference on Artificial Intelligence (2021)
Google Scholar
Graves, A.: Generating sequences with recurrent neural networks. arXiv:abs/1308.0850 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905 (2019). https://doi.org/10.1109/ICASSP.2019.8683561
Huber, C., Hussain, J., Stüker, S., Waibel, A.H.: Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–7 (2021)
Google Scholar
Indermühle, E., Liwicki, M., Bunke, H.: Iamondo-database: an online handwritten document database with non-uniform contents. In: International Workshop on Document Analysis Systems (2010)
Google Scholar
Ke, G., He, D., Liu, T.Y.: Rethinking the positional encoding in language pre-training (2020)
Google Scholar
Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Vancouver, August 2017, pp. 28–39. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-3204, https://aclanthology.org/W17-3204
Kondo, S., Hotate, K., Hirasawa, T., Kaneko, M., Komachi, M.: Sentence concatenation approach to data augmentation for neural machine translation, pp. 143–149 (2021). https://doi.org/10.18653/v1/2021.naacl-srw.18
Krishnan, P., Jawahar, C.: HWNET v2: an efficient word image representation for handwritten documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22, 387–405 (2019)
Article Google Scholar
Lamb, A., Goyal, A., Zhang, Y., Zhang, S., Courville, A., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks (2016)
Google Scholar
Lin, Z., Wan, L.: Style-preserving English handwriting synthesis. Pattern Recogn. 40(7), 2097–2109 (2007). https://doi.org/10.1016/j.patcog.2006.11.024, https://www.sciencedirect.com/science/article/pii/S0031320306004985
Liwicki, M., Bunke, H.: Iam-ondb - an on-line English sentence database acquired from handwritten text on a whiteboard. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 956–961 (2005). https://doi.org/10.1109/ICDAR.2005.132
Lu, Z., et al.: Input length matters: improving RNN-T and MWER training for long-form telephony speech recognition (2021)
Google Scholar
Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: Slogan: Handwriting style synthesis for arbitrary-length and out-of-vocabulary text (2022). https://doi.org/10.48550/ARXIV.2202.11456, https://arxiv.org/abs/2202.11456
Maksai, A., Rowley, H., Berent, J., Musat, C.: INKORRECT: online handwriting spelling correction (2022)
Google Scholar
Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, August 2016, pp. 280–290. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, https://aclanthology.org/K16-1028
Narayanan, A., Prabhavalkar, R., Chiu, C.C., Rybach, D., Sainath, T.N., Strohman, T.: Recognizing long-form speech using streaming end-to-end models. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 920–927 (2019). https://doi.org/10.1109/ASRU46091.2019.9003913
Neishi, M., Yoshinaga, N.: On the relation between position information and sentence length in neural machine translation. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, November 2019, pp. 328–338. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/K19-1031, https://aclanthology.org/K19-1031
Nguyen, H., Nguyen, C., Bao, P., Nakagawa, M.: A database of unconstrained vietnamese online handwriting and recognition experiments by recurrent neural networks. Pattern Recogn. 78, 291–306 (2018)
Google Scholar
Pechwitz, M., Margner, V.: Baseline estimation for Arabic handwritten words. In: Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 479–484 (2002). https://doi.org/10.1109/IWFHR.2002.1030956
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019). https://doi.org/10.48550/ARXIV.1910.10683, https://arxiv.org/abs/1910.10683
Ribeiro, L.S.F., Bui, T., Collomosse, J., Ponti, M.: Sketchformer: Transformer-based representation for sketched structure (2020). https://doi.org/10.48550/ARXIV.2002.10381, https://arxiv.org/abs/2002.10381
Schaldenbrand, P., Liu, Z., Oh, J.: Styleclipdraw: Coupling content and style in text-to-drawing translation (2022). https://doi.org/10.48550/ARXIV.2202.12362, https://arxiv.org/abs/2202.12362
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. NIPS 2014, Cambridge, MA, USA, vol. 2, pp. 3104–3112. MIT Press (2014)
Google Scholar
Thoppilan, R., et al.: LAMDA: language models for dialog applications (2022)
Google Scholar
Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models, pp. 8246–8257 (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zhang, J.X., Ling, Z.H., Dai, L.R.: Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793 (2018). https://doi.org/10.1109/ICASSP.2018.8462020

Download references

Author information

Authors and Affiliations

EPFL, Lausanne, Switzerland
Aleksandr Timofeev & Andrei Afonin
Google Research, Zürich, Switzerland
Anastasiia Fadeeva, Claudiu Musat & Andrii Maksai

Authors

Aleksandr Timofeev
View author publications
You can also search for this author in PubMed Google Scholar
Anastasiia Fadeeva
View author publications
You can also search for this author in PubMed Google Scholar
Andrei Afonin
View author publications
You can also search for this author in PubMed Google Scholar
Claudiu Musat
View author publications
You can also search for this author in PubMed Google Scholar
Andrii Maksai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Aleksandr Timofeev , Anastasiia Fadeeva , Andrei Afonin , Claudiu Musat or Andrii Maksai .

Editor information

Editors and Affiliations

TU Dortmund University, Dortmund, Germany
Gernot A. Fink
Adobe, College Park, MN, USA
Rajiv Jain
Osaka Metropolitan University, Osaka, Japan
Koichi Kise
Rochester Institute of Technology, Rochester, NY, USA
Richard Zanibbi

6 Appendix

Table 9. Recognizer CER on test data, caused by data noise and the model’s mistakes.

Full size table

Table 10. CER of the curve transformer on long test with and without the most horizontal ink.

Full size table

Table 11. CER for different lengths in data augmentation procedure with candidate pool of 15K and threshold 0.5. We compare sets with average lengths of 26, 37, 48, 65, 79, 100. For RNN models, training with lengths \(> 26\) results in poor performance.

Full size table

Table 12. CER for different number of words in split generation on synthetic long validation dataset.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Timofeev, A., Fadeeva, A., Afonin, A., Musat, C., Maksai, A. (2023). DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14190. Springer, Cham. https://doi.org/10.1007/978-3-031-41685-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-41685-9_14
Published: 19 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41684-2
Online ISBN: 978-3-031-41685-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation

Abstract

Access this chapter

Similar content being viewed by others

Generative Ink: Data-Driven Computational Models for Digital Ink

GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images

WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models

Notes

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships