[go: up one dir, main page]

Skip to main content

Advertisement

Log in

Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The sequence decoding framework has dominated the field of scene text recognition. In this framework, the RNN-based (recurrent neural network) decoder is one of the main approaches. The attention mechanism is a key module in the RNN-based decoder. In the decoding stage, the character is decoded based on an estimated attention map. The precision of the attention map is extremely important to the accuracy of the final output. In practice, we find the estimated attention map has encountered attention misalignment phenomena. To address this issue, in this paper, we innovatively propose Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition; we name it CASTER. We employ a thin plate spline transformation to rectify original images with oriented or curved texts and a 31-layer ResNet as backbone to extract visual features. Then, we leverage a two-stage decode mechanism: localization and decoding (coarse decoder) and re-localization and re-decoding (refined decoder) to predict the character sequence. We also introduce a novel context-enhanced encoder by a 2D contextual fusion module to capture the context information. The CASTER can localize the attention region of each character more accurately than the one-stage attention method and thus improve the final recognition performance. Extensive experiments show that CASTER achieves state-of-the-art performance on several standard benchmarks. Our method obtains, respectively, 96.1%, 93.3% and 94.4% recognition accuracies on regular (IIIT5K, SVT) and irregular (CUTE) text datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Xie L, Lee F, Liu L, Kotani K, Chen Q (2020) Scene recognition: a comprehensive survey. Pattern Recogn 102:107205

    Article  Google Scholar 

  2. Chen X, Jin L, Zhu Y, Luo C, Wang T (2021) Text recognition in the wild: a survey. ACM Comput Surv (CSUR) 54(2):1–35

    Article  Google Scholar 

  3. Huang Z, Chen K, He J, Bai X, Karatzas D, Lu S, Jawahar C (2019) Icdar 2019 robust reading challenge on scanned receipts OCR and information extraction. In: International conference on document analysis recognition

  4. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations, ICLR 2015; conference date: 07-05-2015 Through 09-05-2015

  5. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  6. Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048

    Article  Google Scholar 

  7. Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8610–8617

  8. Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: AAAI, pp 12216–12224

  9. Yue X, Kuang Z, Lin C, Sun H, Zhang W (2020) Robustscanner: dynamically enhancing positional clues for robust text recognition. In: European conference on computer vision. Springer, pp 135–151

  10. Wang C, Liu C-L (2020) Scene text recognition by attention network with gated embedding. In: 2020 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  11. Chen X, Wang T, Zhu Y, Jin L, Luo C (2020) Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381:261–271

    Article  Google Scholar 

  12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  13. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  14. Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE international conference on computer vision, pp 5076–5084

  15. Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12113–12122

  16. Fang S, Xie H, Wang Y, Mao Z, Zhang Y (2021) Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7094–7103

  17. Mishra A, Alahari K, Jawahar C (2012) Top-down and bottom-up cues for scene text recognition. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2687–2694

  18. Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: 2011 international conference on computer vision. IEEE, pp 1457–1464

  19. Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda L.G, Mestre S.R, Mas J, Mota D.F, Almazan J.A, De Las Heras L.P (2013) Icdar 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition . IEEE, pp 1484–1493

  20. Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V.R, Lu S, et al. (2015) Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160

  21. Quy Phan T, Shivakumara P, Tian S, Lim Tan C (2013) Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE international conference on computer vision, pp 569–576

  22. Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048

    Article  Google Scholar 

  23. Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning, pp 369–376

  24. He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI’16. AAAI Press, pp 3501–3508

  25. Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304

    Article  Google Scholar 

  26. Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170

    Article  Google Scholar 

  27. Lee C-Y, Osindero S (2016) Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2231–2239

  28. Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S.J, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE international conference on computer vision, pp 4715–4723

  29. Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405

    Article  Google Scholar 

  30. Wan Z, Xie F, Liu Y, Bai X, Yao C (2019) 2d-ctc for scene text recognition. arXiv preprint arXiv:1907.09705

  31. Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Learning to read irregular text with attention mechanisms. In: IJCAI, vol 1, p 3

  32. Liu W, Chen C, Wong K-YK (2018) Char-net: a character-aware neural network for distorted scene text recognition. In: AAAI, vol 1, p 4

  33. Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W (2020) Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13528–13537

  34. Luo C, Jin L, Sun Z (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118

    Article  Google Scholar 

  35. Lin Q, Luo C, Jin L, Lai S (2021) Stan: a sequential transformation attention-based network for scene text recognition. Pattern Recogn 111:107692

    Article  Google Scholar 

  36. Wang C, Liu C-L (2021) Multi-branch guided attention network for irregular text recognition. Neurocomputing 425:278–289

    Article  Google Scholar 

  37. Wang W, Xie E, Sun P, Wang W, Tian L, Shen C, Luo P (2019) Textsr: content-aware text super-resolution guided by recognition. arXiv preprint arXiv:1909.07113

  38. Hu W, Cai X, Hou J, Yi S, Lin Z (2020) Gtc: guided training of ctc towards efficient and accurate scene text recognition. In: AAAI, pp 11005–11012

  39. Zhang H, Yao Q, Yang M, Xu Y, Bai X (2020) Autostr: efficient backbone search for scene text recognition. In: Proceedings of the European conference on computer vision (ECCV)

  40. Lyu P, Yang Z, Leng X, Wu X, Li R, Shen X (2019) 2d attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708

  41. Litman R, Anschel O, Tsiper S, Litman R, Mazor S, Manmatha R (2020) Scatter: selective context attentional scene text recognizer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11962–11972

  42. Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, Bai X (2021) Master: multi-aspect non-local network for scene text recognition. Pattern Recogn 117:107980

    Article  Google Scholar 

  43. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  44. Lucas S.M, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) Icdar 2003 robust reading competitions. In: Seventh international conference on document analysis and recognition, 2003. Proceedings. Citeseer, pp 682–687

  45. Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701

  46. Wang P, Yang L, Li H, Deng Y, Shen C, Zhang Y (2019) A simple and robust convolutional-attention network for irregular text recognition, vol 6. arXiv:1904.01375

  47. Zhan F, Lu S (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2059–2068

  48. Wan Z, He M, Chen H, Bai X, Yao C (2020) Textscanner: reading characters in order for robust scene text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12120–12127

Download references

Funding

This work was supported by Fundamental Research Funds for the Central Universities (Grant numbers 2662022LXYJ004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinran Li.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chi, H., Cai, J. & Li, X. Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition. Neural Comput & Applic 36, 7817–7827 (2024). https://doi.org/10.1007/s00521-024-09493-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-09493-5

Keywords

Navigation