Abstract
The sequence decoding framework has dominated the field of scene text recognition. In this framework, the RNN-based (recurrent neural network) decoder is one of the main approaches. The attention mechanism is a key module in the RNN-based decoder. In the decoding stage, the character is decoded based on an estimated attention map. The precision of the attention map is extremely important to the accuracy of the final output. In practice, we find the estimated attention map has encountered attention misalignment phenomena. To address this issue, in this paper, we innovatively propose Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition; we name it CASTER. We employ a thin plate spline transformation to rectify original images with oriented or curved texts and a 31-layer ResNet as backbone to extract visual features. Then, we leverage a two-stage decode mechanism: localization and decoding (coarse decoder) and re-localization and re-decoding (refined decoder) to predict the character sequence. We also introduce a novel context-enhanced encoder by a 2D contextual fusion module to capture the context information. The CASTER can localize the attention region of each character more accurately than the one-stage attention method and thus improve the final recognition performance. Extensive experiments show that CASTER achieves state-of-the-art performance on several standard benchmarks. Our method obtains, respectively, 96.1%, 93.3% and 94.4% recognition accuracies on regular (IIIT5K, SVT) and irregular (CUTE) text datasets.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Xie L, Lee F, Liu L, Kotani K, Chen Q (2020) Scene recognition: a comprehensive survey. Pattern Recogn 102:107205
Chen X, Jin L, Zhu Y, Luo C, Wang T (2021) Text recognition in the wild: a survey. ACM Comput Surv (CSUR) 54(2):1–35
Huang Z, Chen K, He J, Bai X, Karatzas D, Lu S, Jawahar C (2019) Icdar 2019 robust reading challenge on scanned receipts OCR and information extraction. In: International conference on document analysis recognition
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations, ICLR 2015; conference date: 07-05-2015 Through 09-05-2015
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X (2018) Aster: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell 41(9):2035–2048
Li H, Wang P, Shen C, Zhang G (2019) Show, attend and read: a simple and strong baseline for irregular text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8610–8617
Wang T, Zhu Y, Jin L, Luo C, Chen X, Wu Y, Wang Q, Cai M (2020) Decoupled attention network for text recognition. In: AAAI, pp 12216–12224
Yue X, Kuang Z, Lin C, Sun H, Zhang W (2020) Robustscanner: dynamically enhancing positional clues for robust text recognition. In: European conference on computer vision. Springer, pp 135–151
Wang C, Liu C-L (2020) Scene text recognition by attention network with gated embedding. In: 2020 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Chen X, Wang T, Zhu Y, Jin L, Luo C (2020) Adaptive embedding gate for attention-based scene text recognition. Neurocomputing 381:261–271
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Cheng Z, Bai F, Xu Y, Zheng G, Pu S, Zhou S (2017) Focusing attention: towards accurate text recognition in natural images. In: Proceedings of the IEEE international conference on computer vision, pp 5076–5084
Yu D, Li X, Zhang C, Liu T, Han J, Liu J, Ding E (2020) Towards accurate scene text recognition with semantic reasoning networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12113–12122
Fang S, Xie H, Wang Y, Mao Z, Zhang Y (2021) Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7094–7103
Mishra A, Alahari K, Jawahar C (2012) Top-down and bottom-up cues for scene text recognition. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2687–2694
Wang K, Babenko B, Belongie S (2011) End-to-end scene text recognition. In: 2011 international conference on computer vision. IEEE, pp 1457–1464
Karatzas D, Shafait F, Uchida S, Iwamura M, i Bigorda L.G, Mestre S.R, Mas J, Mota D.F, Almazan J.A, De Las Heras L.P (2013) Icdar 2013 robust reading competition. In: 2013 12th international conference on document analysis and recognition . IEEE, pp 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, Ghosh S, Bagdanov A, Iwamura M, Matas J, Neumann L, Chandrasekhar V.R, Lu S, et al. (2015) Icdar 2015 competition on robust reading. In: 2015 13th international conference on document analysis and recognition (ICDAR). IEEE, pp 1156–1160
Quy Phan T, Shivakumara P, Tian S, Lim Tan C (2013) Recognizing text with perspective distortion in natural scenes. In: Proceedings of the IEEE international conference on computer vision, pp 569–576
Risnumawan A, Shivakumara P, Chan CS, Tan CL (2014) A robust arbitrary text detection system for natural scene images. Expert Syst Appl 41(18):8027–8048
Graves A, Fernández S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning, pp 369–376
He P, Huang W, Qiao Y, Loy CC, Tang X (2016) Reading scene text in deep convolutional sequences. In: Proceedings of the thirtieth AAAI conference on artificial intelligence. AAAI’16. AAAI Press, pp 3501–3508
Shi B, Bai X, Yao C (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell 39(11):2298–2304
Gao Y, Chen Y, Wang J, Tang M, Lu H (2019) Reading scene text with fully convolutional sequence modeling. Neurocomputing 339:161–170
Lee C-Y, Osindero S (2016) Recursive recurrent nets with attention modeling for OCR in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2231–2239
Baek J, Kim G, Lee J, Park S, Han D, Yun S, Oh S.J, Lee H (2019) What is wrong with scene text recognition model comparisons? dataset and model analysis. In: Proceedings of the IEEE international conference on computer vision, pp 4715–4723
Su B, Lu S (2017) Accurate recognition of words in scenes without character segmentation using recurrent neural network. Pattern Recogn 63:397–405
Wan Z, Xie F, Liu Y, Bai X, Yao C (2019) 2d-ctc for scene text recognition. arXiv preprint arXiv:1907.09705
Yang X, He D, Zhou Z, Kifer D, Giles CL (2017) Learning to read irregular text with attention mechanisms. In: IJCAI, vol 1, p 3
Liu W, Chen C, Wong K-YK (2018) Char-net: a character-aware neural network for distorted scene text recognition. In: AAAI, vol 1, p 4
Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W (2020) Seed: semantics enhanced encoder-decoder framework for scene text recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13528–13537
Luo C, Jin L, Sun Z (2019) Moran: a multi-object rectified attention network for scene text recognition. Pattern Recogn 90:109–118
Lin Q, Luo C, Jin L, Lai S (2021) Stan: a sequential transformation attention-based network for scene text recognition. Pattern Recogn 111:107692
Wang C, Liu C-L (2021) Multi-branch guided attention network for irregular text recognition. Neurocomputing 425:278–289
Wang W, Xie E, Sun P, Wang W, Tian L, Shen C, Luo P (2019) Textsr: content-aware text super-resolution guided by recognition. arXiv preprint arXiv:1909.07113
Hu W, Cai X, Hou J, Yi S, Lin Z (2020) Gtc: guided training of ctc towards efficient and accurate scene text recognition. In: AAAI, pp 11005–11012
Zhang H, Yao Q, Yang M, Xu Y, Bai X (2020) Autostr: efficient backbone search for scene text recognition. In: Proceedings of the European conference on computer vision (ECCV)
Lyu P, Yang Z, Leng X, Wu X, Li R, Shen X (2019) 2d attentional irregular scene text recognizer. arXiv preprint arXiv:1906.05708
Litman R, Anschel O, Tsiper S, Litman R, Mazor S, Manmatha R (2020) Scatter: selective context attentional scene text recognizer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11962–11972
Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, Bai X (2021) Master: multi-aspect non-local network for scene text recognition. Pattern Recogn 117:107980
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Lucas S.M, Panaretos A, Sosa L, Tang A, Wong S, Young R (2003) Icdar 2003 robust reading competitions. In: Seventh international conference on document analysis and recognition, 2003. Proceedings. Citeseer, pp 682–687
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701
Wang P, Yang L, Li H, Deng Y, Shen C, Zhang Y (2019) A simple and robust convolutional-attention network for irregular text recognition, vol 6. arXiv:1904.01375
Zhan F, Lu S (2019) Esir: end-to-end scene text recognition via iterative image rectification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2059–2068
Wan Z, He M, Chen H, Bai X, Yao C (2020) Textscanner: reading characters in order for robust scene text recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12120–12127
Funding
This work was supported by Fundamental Research Funds for the Central Universities (Grant numbers 2662022LXYJ004).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chi, H., Cai, J. & Li, X. Cascade 2D attentional decoders with context-enhanced encoder for scene text recognition. Neural Comput & Applic 36, 7817–7827 (2024). https://doi.org/10.1007/s00521-024-09493-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09493-5