[go: up one dir, main page]

Skip to main content

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Abstract

We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question. TextVQG is an essential ability for a conversational agent. However, it is challenging as it requires an in-depth understanding of the scene and the ability to semantically bridge the visual content with the text present in the image. To address TextVQG, we present an OCR-consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question. We refer to our proposed model as OLRA. We perform an extensive evaluation of OLRA on two public benchmarks and compare them against baselines. Our model – OLRA automatically generates questions similar to the public text-based visual question answering datasets that were curated manually. Moreover, we ‘significantly’ outperform baseline approaches on the performance measures popularly used in text generation literature.

“To seek truth requires one to ask the right questions.”

Suzy Kassem

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For one of the datasets namely TextVQA, OCR-tokens extracted from Rosetta [6] are provided with the dataset.

  2. 2.

    When we use CRAFT for text detection, we use only first four positional features i.e., topleft-x, topleft-y, width and height of the bounding box.

References

  1. ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering. https://rrc.cvc.uab.es/?ch=11. Accessed 01 Feb 2021

  2. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  3. Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)

    Google Scholar 

  4. Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)

    Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)

    Article  Google Scholar 

  6. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: KDD (2018)

    Google Scholar 

  7. Celikyilmaz, A., Clark, E., Gao, J.: Evaluation of text generation: a survey. CoRR abs/2006.14799 (2020)

    Google Scholar 

  8. Fan, Z., Wei, Z., Li, P., Lan, Y., Huang, X.: A question type driven framework to diversify visual question generation. In: IJCAI (2018)

    Google Scholar 

  9. Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR (2020)

    Google Scholar 

  10. Gülçehre, Ç., Dutil, F., Trischler, A., Bengio, Y.: Plan, attend, generate: Planning for sequence-to-sequence models. In: NIPS (2017)

    Google Scholar 

  11. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: CVPR (2020)

    Google Scholar 

  12. Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision 116(1), 1–20 (2016)

    Article  MathSciNet  Google Scholar 

  13. Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)

    Google Scholar 

  14. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  15. Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W.M.: Words matter: scene text for image classification and retrieval. IEEE Trans. Multimed. 19(5), 1063–1076 (2017)

    Article  Google Scholar 

  16. Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017)

  17. Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: CVPR (2019)

    Google Scholar 

  18. Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: WMT@ACL (2007)

    Google Scholar 

  19. Li, Y., et al.: Visual question generation as dual task of visual question answering. In: CVPR (2018)

    Google Scholar 

  20. Liu, F., Xu, G., Wu, Q., Du, Q., Jia, W., Tan, M.: Cascade reasoning network for text-based visual question answering. In: ACM Multimedia (2020)

    Google Scholar 

  21. Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vision 129(1), 161–184 (2021)

    Article  Google Scholar 

  22. Lopez, L.E., Cruz, D.K., Cruz, J.C.B., Cheng, C.: Transformer-based end-to-end question generation. CoRR abs/2005.01107 (2020)

    Google Scholar 

  23. Luo, R., Price, B.L., Cohen, S., Shakhnarovich, G.: Discriminability objective for training descriptive captions. In: CVPR (2018)

    Google Scholar 

  24. Mafla, A., de Rezende, R.S., Gómez, L., Larlus, D., Karatzas, D.: Stacmr: scene-text aware cross-modal retrieval. CoRR abs/2012.04329 (2020)

    Google Scholar 

  25. Mathew, M., Karatzas, D., Manmatha, R., Jawahar, C.V.: DocVQA: a dataset for VQA on document images. In: WACV (2021)

    Google Scholar 

  26. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL (2013)

    Google Scholar 

  27. Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC (2012)

    Google Scholar 

  28. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  29. Misra, I., Girshick, R.B., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)

    Google Scholar 

  30. Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. In: ACL (2016)

    Google Scholar 

  31. Neumann, L., Matas, J.: Real-time lexicon-free scene text localization and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1872–1885 (2016)

    Article  Google Scholar 

  32. Patro, B.N., Kurmi, V.K., Kumar, S., Namboodiri, V.P.: Deep bayesian network for visual question generation. In: WACV (2020)

    Google Scholar 

  33. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)

    Google Scholar 

  34. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  35. Serban, I.V., et al.: Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus. In: ACL (2016)

    Google Scholar 

  36. Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)

    Article  Google Scholar 

  37. Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)

    Google Scholar 

  38. Singh, A.K., Mishra, A., Shekhar, S., Chakraborty, A.: From strings to things: Knowledge-enabled VQA model that can read and reason. In: ICCV (2019)

    Google Scholar 

  39. Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

    Google Scholar 

  40. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)

    Google Scholar 

  41. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)

    Google Scholar 

  42. Wang, K., Babenko, B., Belongie, S.J.: End-to-end scene text recognition. In: ICCV (2011)

    Google Scholar 

  43. Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Visual curiosity: Learning to ask questions to learn visual recognition. In: CoRL. Proceedings of Machine Learning Research (2018)

    Google Scholar 

  44. Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. In: IJCAI (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shankar Gangisetty .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jahagirdar, S., Gangisetty, S., Mishra, A. (2021). Look, Read and Ask: Learning to Ask Questions by Reading Text in Images. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86549-8_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86548-1

  • Online ISBN: 978-3-030-86549-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics