Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

International Conference on Document Analysis and Recognition

4125 Accesses
4 Citations

Abstract

We present a novel problem of text-based visual question generation or TextVQG in short. Given the recent growing interest of the document image analysis community in combining text understanding with conversational artificial intelligence, e.g., text-based visual question answering, TextVQG becomes an important task. TextVQG aims to generate a natural language question for a given input image and an automatically extracted text also known as OCR token from it such that the OCR token is an answer to the generated question. TextVQG is an essential ability for a conversational agent. However, it is challenging as it requires an in-depth understanding of the scene and the ability to semantically bridge the visual content with the text present in the image. To address TextVQG, we present an OCR-consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question. We refer to our proposed model as OLRA. We perform an extensive evaluation of OLRA on two public benchmarks and compare them against baselines. Our model – OLRA automatically generates questions similar to the public text-based visual question answering datasets that were curated manually. Moreover, we ‘significantly’ outperform baseline approaches on the performance measures popularly used in text generation literature.

“To seek truth requires one to ask the right questions.”

Suzy Kassem

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Article 29 August 2017

Visual Question Generation Under Multi-granularity Cross-Modal Interaction

Revisiting Visual Question Answering Baselines

Notes

1.
For one of the datasets namely TextVQA, OCR-tokens extracted from Rosetta [6] are provided with the dataset.
2.
When we use CRAFT for text detection, we use only first four positional features i.e., topleft-x, topleft-y, width and height of the bounding box.

References

ICDAR 2019 Robust Reading Challenge on Scene Text Visual Question Answering. https://rrc.cvc.uab.es/?ch=11. Accessed 01 Feb 2021
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., Lee, H.: Character region awareness for text detection. In: CVPR (2019)
Google Scholar
Biten, A.F., et al.: Scene text visual question answering. In: ICCV (2019)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017)
Article Google Scholar
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. In: KDD (2018)
Google Scholar
Celikyilmaz, A., Clark, E., Gao, J.: Evaluation of text generation: a survey. CoRR abs/2006.14799 (2020)
Google Scholar
Fan, Z., Wei, Z., Li, P., Lan, Y., Huang, X.: A question type driven framework to diversify visual question generation. In: IJCAI (2018)
Google Scholar
Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. In: CVPR (2020)
Google Scholar
Gülçehre, Ç., Dutil, F., Trischler, A., Bengio, Y.: Plan, attend, generate: Planning for sequence-to-sequence models. In: NIPS (2017)
Google Scholar
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In: CVPR (2020)
Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Reading text in the wild with convolutional neural networks. Int. J. Comput. Vision 116(1), 1–20 (2016)
Article MathSciNet Google Scholar
Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Karaoglu, S., Tao, R., Gevers, T., Smeulders, A.W.M.: Words matter: scene text for image classification and retrieval. IEEE Trans. Multimed. 19(5), 1063–1076 (2017)
Article Google Scholar
Krasin, I., et al.: Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages (2017)
Krishna, R., Bernstein, M., Fei-Fei, L.: Information maximizing visual question generation. In: CVPR (2019)
Google Scholar
Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: WMT@ACL (2007)
Google Scholar
Li, Y., et al.: Visual question generation as dual task of visual question answering. In: CVPR (2018)
Google Scholar
Liu, F., Xu, G., Wu, Q., Du, Q., Jia, W., Tan, M.: Cascade reasoning network for text-based visual question answering. In: ACM Multimedia (2020)
Google Scholar
Long, S., He, X., Yao, C.: Scene text detection and recognition: the deep learning era. Int. J. Comput. Vision 129(1), 161–184 (2021)
Article Google Scholar
Lopez, L.E., Cruz, D.K., Cruz, J.C.B., Cheng, C.: Transformer-based end-to-end question generation. CoRR abs/2005.01107 (2020)
Google Scholar
Luo, R., Price, B.L., Cohen, S., Shakhnarovich, G.: Discriminability objective for training descriptive captions. In: CVPR (2018)
Google Scholar
Mafla, A., de Rezende, R.S., Gómez, L., Larlus, D., Karatzas, D.: Stacmr: scene-text aware cross-modal retrieval. CoRR abs/2012.04329 (2020)
Google Scholar
Mathew, M., Karatzas, D., Manmatha, R., Jawahar, C.V.: DocVQA: a dataset for VQA on document images. In: WACV (2021)
Google Scholar
Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL (2013)
Google Scholar
Mishra, A., Alahari, K., Jawahar, C.V.: Scene text recognition using higher order language priors. In: BMVC (2012)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Misra, I., Girshick, R.B., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)
Google Scholar
Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., Vanderwende, L.: Generating natural questions about an image. In: ACL (2016)
Google Scholar
Neumann, L., Matas, J.: Real-time lexicon-free scene text localization and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 38(9), 1872–1885 (2016)
Article Google Scholar
Patro, B.N., Kurmi, V.K., Kumar, S., Namboodiri, V.P.: Deep bayesian network for visual question generation. In: WACV (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Serban, I.V., et al.: Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus. In: ACL (2016)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2017)
Article Google Scholar
Sidorov, O., Hu, R., Rohrbach, M., Singh, A.: Textcaps: a dataset for image captioning with reading comprehension. In: ECCV (2020)
Google Scholar
Singh, A.K., Mishra, A., Shekhar, S., Chakraborty, A.: From strings to things: Knowledge-enabled VQA model that can read and reason. In: ICCV (2019)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Google Scholar
Wang, K., Babenko, B., Belongie, S.J.: End-to-end scene text recognition. In: ICCV (2011)
Google Scholar
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Visual curiosity: Learning to ask questions to learn visual recognition. In: CoRL. Proceedings of Machine Learning Research (2018)
Google Scholar
Zhang, S., Qu, L., You, S., Yang, Z., Zhang, J.: Automatic generation of grounded visual questions. In: IJCAI (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

KLE Technological University, Hubballi, India
Soumya Jahagirdar & Shankar Gangisetty
Vision, Language, and Learning Group (VL2G), Indian Institue of Technology Jodhpur, Jodhpur, Rajasthan, India
Anand Mishra

Authors

Soumya Jahagirdar
View author publications
You can also search for this author in PubMed Google Scholar
Shankar Gangisetty
View author publications
You can also search for this author in PubMed Google Scholar
Anand Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shankar Gangisetty .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jahagirdar, S., Gangisetty, S., Mishra, A. (2021). Look, Read and Ask: Learning to Ask Questions by Reading Text in Images. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-86549-8_22
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86548-1
Online ISBN: 978-3-030-86549-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Visual Question Generation Under Multi-granularity Cross-Modal Interaction

Revisiting Visual Question Answering Baselines

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Look, Read and Ask: Learning to Ask Questions by Reading Text in Images

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Visual Question Generation Under Multi-granularity Cross-Modal Interaction

Revisiting Visual Question Answering Baselines

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation