[go: up one dir, main page]

Skip to main content

DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14190))

Included in the following conference series:

Abstract

As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.

A. Timofeev and A. Afonin—Work done as a student researcher at Google Research, Zürich, Switzerland.

A. Timofeev and A. Fadeeva—These authors contributed equally to this work and share first authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    A notebook to test model inference is available here: https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.

  2. 2.

    A notebook with test sets and model inference https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.

References

  1. Aksan, E., Pece, F., Hilliges, O.: DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. In: SIGCHI Conference on Human Factors in Computing Systems. CHI 2018, New York, NY, USA. ACM (2018)

    Google Scholar 

  2. Alonso, E., Moysset, B., Messina, R.O.: Adversarial generation of handwritten text images conditioned on sequences. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 481–486 (2019)

    Google Scholar 

  3. Anil, C., et al.: Exploring length generalization in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=zSkYVeX7bC4

  4. Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.M.: Scheduled sampling for sequence prediction with recurrent neural networks. arXiv:abs/1506.03099 (2015)

  5. Bishop, C.M.: Mixture density networks (1994)

    Google Scholar 

  6. Carbune, V., et al.: Fast multi-language LSTM-based online handwriting recognition. Int. J. Doc. Anal. Recogn. (IJDAR) 23, 89–102 (2020)

    Article  Google Scholar 

  7. Chang, J.H.R., Shrivastava, A., Koppula, H.S., Zhang, X., Tuzel, O.: Style equalization: Unsupervised learning of controllable generative sequence models (2022). https://arxiv.org/abs/2110.02891

  8. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations (2020)

    Google Scholar 

  9. Chiu, C.C., et al.: RNN-T models fail to generalize to out-of-domain audio: causes and solutions, pp. 873–880 (2021). https://doi.org/10.1109/SLT48900.2021.9383518

  10. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context (2019). https://doi.org/10.48550/ARXIV.1901.02860, https://arxiv.org/abs/1901.02860

  11. Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.-Z.: BézierSketch: a generative model for scalable vector sketches. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 632–647. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_38

    Chapter  Google Scholar 

  12. Djioua, M., Plamondon, R.: An interactive system for the automatic generation of huge handwriting databases from a few specimens. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008)

    Google Scholar 

  13. Elarian, Y., Abdel-Aal, R., Ahmad, I., Parvez, M.T., Zidouri, A.: Handwriting synthesis: classifications and techniques. Int. J. Document Anal. Recogn. (IJDAR) 17(4), 455–469 (2014). https://doi.org/10.1007/s10032-014-0231-x

    Article  Google Scholar 

  14. Fisher, R.A.: On the interpretation of from contingency tables, and the calculation of p. J. Roy. Stat. Soc. 85 (1922). http://www.medicine.mcgill.ca

  15. Gan, J., Wang, W.: HIGAN: handwriting imitation conditioned on arbitrary-length texts and disentangled styles. In: AAAI Conference on Artificial Intelligence (2021)

    Google Scholar 

  16. Graves, A.: Generating sequences with recurrent neural networks. arXiv:abs/1308.0850 (2013)

  17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  18. Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905 (2019). https://doi.org/10.1109/ICASSP.2019.8683561

  19. Huber, C., Hussain, J., Stüker, S., Waibel, A.H.: Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–7 (2021)

    Google Scholar 

  20. Indermühle, E., Liwicki, M., Bunke, H.: Iamondo-database: an online handwritten document database with non-uniform contents. In: International Workshop on Document Analysis Systems (2010)

    Google Scholar 

  21. Ke, G., He, D., Liu, T.Y.: Rethinking the positional encoding in language pre-training (2020)

    Google Scholar 

  22. Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Vancouver, August 2017, pp. 28–39. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-3204, https://aclanthology.org/W17-3204

  23. Kondo, S., Hotate, K., Hirasawa, T., Kaneko, M., Komachi, M.: Sentence concatenation approach to data augmentation for neural machine translation, pp. 143–149 (2021). https://doi.org/10.18653/v1/2021.naacl-srw.18

  24. Krishnan, P., Jawahar, C.: HWNET v2: an efficient word image representation for handwritten documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22, 387–405 (2019)

    Article  Google Scholar 

  25. Lamb, A., Goyal, A., Zhang, Y., Zhang, S., Courville, A., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks (2016)

    Google Scholar 

  26. Lin, Z., Wan, L.: Style-preserving English handwriting synthesis. Pattern Recogn. 40(7), 2097–2109 (2007). https://doi.org/10.1016/j.patcog.2006.11.024, https://www.sciencedirect.com/science/article/pii/S0031320306004985

  27. Liwicki, M., Bunke, H.: Iam-ondb - an on-line English sentence database acquired from handwritten text on a whiteboard. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 956–961 (2005). https://doi.org/10.1109/ICDAR.2005.132

  28. Lu, Z., et al.: Input length matters: improving RNN-T and MWER training for long-form telephony speech recognition (2021)

    Google Scholar 

  29. Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: Slogan: Handwriting style synthesis for arbitrary-length and out-of-vocabulary text (2022). https://doi.org/10.48550/ARXIV.2202.11456, https://arxiv.org/abs/2202.11456

  30. Maksai, A., Rowley, H., Berent, J., Musat, C.: INKORRECT: online handwriting spelling correction (2022)

    Google Scholar 

  31. Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, August 2016, pp. 280–290. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, https://aclanthology.org/K16-1028

  32. Narayanan, A., Prabhavalkar, R., Chiu, C.C., Rybach, D., Sainath, T.N., Strohman, T.: Recognizing long-form speech using streaming end-to-end models. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 920–927 (2019). https://doi.org/10.1109/ASRU46091.2019.9003913

  33. Neishi, M., Yoshinaga, N.: On the relation between position information and sentence length in neural machine translation. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, November 2019, pp. 328–338. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/K19-1031, https://aclanthology.org/K19-1031

  34. Nguyen, H., Nguyen, C., Bao, P., Nakagawa, M.: A database of unconstrained vietnamese online handwriting and recognition experiments by recurrent neural networks. Pattern Recogn. 78, 291–306 (2018)

    Google Scholar 

  35. Pechwitz, M., Margner, V.: Baseline estimation for Arabic handwritten words. In: Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 479–484 (2002). https://doi.org/10.1109/IWFHR.2002.1030956

  36. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019). https://doi.org/10.48550/ARXIV.1910.10683, https://arxiv.org/abs/1910.10683

  37. Ribeiro, L.S.F., Bui, T., Collomosse, J., Ponti, M.: Sketchformer: Transformer-based representation for sketched structure (2020). https://doi.org/10.48550/ARXIV.2002.10381, https://arxiv.org/abs/2002.10381

  38. Schaldenbrand, P., Liu, Z., Oh, J.: Styleclipdraw: Coupling content and style in text-to-drawing translation (2022). https://doi.org/10.48550/ARXIV.2202.12362, https://arxiv.org/abs/2202.12362

  39. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. NIPS 2014, Cambridge, MA, USA, vol. 2, pp. 3104–3112. MIT Press (2014)

    Google Scholar 

  40. Thoppilan, R., et al.: LAMDA: language models for dialog applications (2022)

    Google Scholar 

  41. Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models, pp. 8246–8257 (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650

  42. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  43. Zhang, J.X., Ling, Z.H., Dai, L.R.: Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793 (2018). https://doi.org/10.1109/ICASSP.2018.8462020

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Aleksandr Timofeev , Anastasiia Fadeeva , Andrei Afonin , Claudiu Musat or Andrii Maksai .

Editor information

Editors and Affiliations

6 Appendix

6 Appendix

Table 9. Recognizer CER on test data, caused by data noise and the model’s mistakes.
Table 10. CER of the curve transformer on long test with and without the most horizontal ink.
Table 11. CER for different lengths in data augmentation procedure with candidate pool of 15K and threshold 0.5. We compare sets with average lengths of 26, 37, 48, 65, 79, 100. For RNN models, training with lengths \(> 26\) results in poor performance.
Table 12. CER for different number of words in split generation on synthetic long validation dataset.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Timofeev, A., Fadeeva, A., Afonin, A., Musat, C., Maksai, A. (2023). DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14190. Springer, Cham. https://doi.org/10.1007/978-3-031-41685-9_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41685-9_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41684-2

  • Online ISBN: 978-3-031-41685-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics