Abstract
As text generative models can give increasingly long answers, we tackle the problem of synthesizing long text in digital ink. We show that the commonly used models for this task fail to generalize to long-form data and how this problem can be solved by augmenting the training data, changing the model architecture and the inference procedure. These methods use contrastive learning technique and are tailored specifically for the handwriting domain. They can be applied to any encoder-decoder model that works with digital ink. We demonstrate that our method reduces the character error rate on long-form English data by half compared to baseline RNN and by 16% compared to the previous approach that aims at addressing the same problem. We show that all three parts of the method improve recognizability of generated inks. In addition, we evaluate synthesized data in a human study and find that people perceive most of generated data as real.
A. Timofeev and A. Afonin—Work done as a student researcher at Google Research, Zürich, Switzerland.
A. Timofeev and A. Fadeeva—These authors contributed equally to this work and share first authorship.
Similar content being viewed by others
Notes
- 1.
A notebook to test model inference is available here: https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.
- 2.
A notebook with test sets and model inference https://colab.research.google.com/drive/1SB_vyDcsdSq1CtE9IOD9opBR9IDgG0ly.
References
Aksan, E., Pece, F., Hilliges, O.: DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. In: SIGCHI Conference on Human Factors in Computing Systems. CHI 2018, New York, NY, USA. ACM (2018)
Alonso, E., Moysset, B., Messina, R.O.: Adversarial generation of handwritten text images conditioned on sequences. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 481–486 (2019)
Anil, C., et al.: Exploring length generalization in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=zSkYVeX7bC4
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.M.: Scheduled sampling for sequence prediction with recurrent neural networks. arXiv:abs/1506.03099 (2015)
Bishop, C.M.: Mixture density networks (1994)
Carbune, V., et al.: Fast multi-language LSTM-based online handwriting recognition. Int. J. Doc. Anal. Recogn. (IJDAR) 23, 89–102 (2020)
Chang, J.H.R., Shrivastava, A., Koppula, H.S., Zhang, X., Tuzel, O.: Style equalization: Unsupervised learning of controllable generative sequence models (2022). https://arxiv.org/abs/2110.02891
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations (2020)
Chiu, C.C., et al.: RNN-T models fail to generalize to out-of-domain audio: causes and solutions, pp. 873–880 (2021). https://doi.org/10.1109/SLT48900.2021.9383518
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-xl: attentive language models beyond a fixed-length context (2019). https://doi.org/10.48550/ARXIV.1901.02860, https://arxiv.org/abs/1901.02860
Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.-Z.: BézierSketch: a generative model for scalable vector sketches. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 632–647. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_38
Djioua, M., Plamondon, R.: An interactive system for the automatic generation of huge handwriting databases from a few specimens. In: 2008 19th International Conference on Pattern Recognition, pp. 1–4 (2008)
Elarian, Y., Abdel-Aal, R., Ahmad, I., Parvez, M.T., Zidouri, A.: Handwriting synthesis: classifications and techniques. Int. J. Document Anal. Recogn. (IJDAR) 17(4), 455–469 (2014). https://doi.org/10.1007/s10032-014-0231-x
Fisher, R.A.: On the interpretation of from contingency tables, and the calculation of p. J. Roy. Stat. Soc. 85 (1922). http://www.medicine.mcgill.ca
Gan, J., Wang, W.: HIGAN: handwriting imitation conditioned on arbitrary-length texts and disentangled styles. In: AAAI Conference on Artificial Intelligence (2021)
Graves, A.: Generating sequences with recurrent neural networks. arXiv:abs/1308.0850 (2013)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–80 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Hsu, W.N., et al.: Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905 (2019). https://doi.org/10.1109/ICASSP.2019.8683561
Huber, C., Hussain, J., Stüker, S., Waibel, A.H.: Instant one-shot word-learning for context-specific neural sequence-to-sequence speech recognition. In: 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–7 (2021)
Indermühle, E., Liwicki, M., Bunke, H.: Iamondo-database: an online handwritten document database with non-uniform contents. In: International Workshop on Document Analysis Systems (2010)
Ke, G., He, D., Liu, T.Y.: Rethinking the positional encoding in language pre-training (2020)
Koehn, P., Knowles, R.: Six challenges for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation, Vancouver, August 2017, pp. 28–39. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/W17-3204, https://aclanthology.org/W17-3204
Kondo, S., Hotate, K., Hirasawa, T., Kaneko, M., Komachi, M.: Sentence concatenation approach to data augmentation for neural machine translation, pp. 143–149 (2021). https://doi.org/10.18653/v1/2021.naacl-srw.18
Krishnan, P., Jawahar, C.: HWNET v2: an efficient word image representation for handwritten documents. Int. J. Doc. Anal. Recogn. (IJDAR) 22, 387–405 (2019)
Lamb, A., Goyal, A., Zhang, Y., Zhang, S., Courville, A., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks (2016)
Lin, Z., Wan, L.: Style-preserving English handwriting synthesis. Pattern Recogn. 40(7), 2097–2109 (2007). https://doi.org/10.1016/j.patcog.2006.11.024, https://www.sciencedirect.com/science/article/pii/S0031320306004985
Liwicki, M., Bunke, H.: Iam-ondb - an on-line English sentence database acquired from handwritten text on a whiteboard. In: Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), vol. 2, pp. 956–961 (2005). https://doi.org/10.1109/ICDAR.2005.132
Lu, Z., et al.: Input length matters: improving RNN-T and MWER training for long-form telephony speech recognition (2021)
Luo, C., Zhu, Y., Jin, L., Li, Z., Peng, D.: Slogan: Handwriting style synthesis for arbitrary-length and out-of-vocabulary text (2022). https://doi.org/10.48550/ARXIV.2202.11456, https://arxiv.org/abs/2202.11456
Maksai, A., Rowley, H., Berent, J., Musat, C.: INKORRECT: online handwriting spelling correction (2022)
Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., Xiang, B.: Abstractive text summarization using sequence-to-sequence RNNs and beyond. In: Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, August 2016, pp. 280–290. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/K16-1028, https://aclanthology.org/K16-1028
Narayanan, A., Prabhavalkar, R., Chiu, C.C., Rybach, D., Sainath, T.N., Strohman, T.: Recognizing long-form speech using streaming end-to-end models. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 920–927 (2019). https://doi.org/10.1109/ASRU46091.2019.9003913
Neishi, M., Yoshinaga, N.: On the relation between position information and sentence length in neural machine translation. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, November 2019, pp. 328–338. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/K19-1031, https://aclanthology.org/K19-1031
Nguyen, H., Nguyen, C., Bao, P., Nakagawa, M.: A database of unconstrained vietnamese online handwriting and recognition experiments by recurrent neural networks. Pattern Recogn. 78, 291–306 (2018)
Pechwitz, M., Margner, V.: Baseline estimation for Arabic handwritten words. In: Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, pp. 479–484 (2002). https://doi.org/10.1109/IWFHR.2002.1030956
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019). https://doi.org/10.48550/ARXIV.1910.10683, https://arxiv.org/abs/1910.10683
Ribeiro, L.S.F., Bui, T., Collomosse, J., Ponti, M.: Sketchformer: Transformer-based representation for sketched structure (2020). https://doi.org/10.48550/ARXIV.2002.10381, https://arxiv.org/abs/2002.10381
Schaldenbrand, P., Liu, Z., Oh, J.: Styleclipdraw: Coupling content and style in text-to-drawing translation (2022). https://doi.org/10.48550/ARXIV.2202.12362, https://arxiv.org/abs/2202.12362
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. NIPS 2014, Cambridge, MA, USA, vol. 2, pp. 3104–3112. MIT Press (2014)
Thoppilan, R., et al.: LAMDA: language models for dialog applications (2022)
Varis, D., Bojar, O.: Sequence length is a domain: length-based overfitting in transformer models, pp. 8246–8257 (2021). https://doi.org/10.18653/v1/2021.emnlp-main.650
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zhang, J.X., Ling, Z.H., Dai, L.R.: Forward attention in sequence- to-sequence acoustic modeling for speech synthesis. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4789–4793 (2018). https://doi.org/10.1109/ICASSP.2018.8462020
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
6 Appendix
6 Appendix
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Timofeev, A., Fadeeva, A., Afonin, A., Musat, C., Maksai, A. (2023). DSS: Synthesizing Long Digital Ink Using Data Augmentation, Style Encoding and Split Generation. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14190. Springer, Cham. https://doi.org/10.1007/978-3-031-41685-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-41685-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41684-2
Online ISBN: 978-3-031-41685-9
eBook Packages: Computer ScienceComputer Science (R0)