FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

Pinaki Nath Chowdhury^12,13,
Aneeshan Sain^12,13,
Ayan Kumar Bhunia¹²,
Tao Xiang^12,13,
Yulia Gryaditskaya^12,14 &
…
Yi-Zhe Song^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13668))

Included in the following conference series:

European Conference on Computer Vision

Abstract

We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. With practical applications in mind, we collect sketches that convey scene content well but can be sketched within a few minutes by a person with any sketching skills. Our dataset comprises 10, 000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals, offering both object- and scene-level abstraction. Each sketch is augmented with its text description. Using our dataset, we study for the first time the problem of fine-grained image retrieval from freehand scene sketches and sketch captions. We draw insights on: (i) Scene salience encoded in sketches using the strokes temporal order; (ii) Performance comparison of image retrieval from a scene sketch and an image caption; (iii) Complementarity of information in sketches and image captions, as well as the potential benefit of combining the two modalities. In addition, we extend a popular vector sketch LSTM-based encoder to handle sketches with larger complexity than was supported by previous work. Namely, we propose a hierarchical sketch decoder, which we leverage at a sketch-specific “pretext” task. Our dataset enables for the first time research on freehand scene sketch understanding and its practical applications. We release the dataset under CC BY-NC 4.0 license: FS-COCO dataset (https://github.com/pinakinathc/fscoco).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SketchyScene: Richly-Annotated Scene Sketches

Stroke-based semantic segmentation for scene-level free-hand sketches

Article 07 December 2022

Sketchy Scene Captioning: Learning Multi-level Semantic Information from Sparse Visual Scene Cues

Notes

1.
https://github.com/pinakinathc/SketchX-SST.
2.
https://github.com/openai/CLIP.
3.
The performance of image captioning goes up to 170.5 when 100 generated captions are evaluated against the ground-truth instead of 1.

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., Torralba, A.: Cross-modal scene networks. IEEE-TPAMI 40(10), 2303–2314 (2018)
Article Google Scholar
Ba, J., Kiros, J.R., Hinton, G.E.: Layer normalization. In: NIPS Deep Learning Symposium (2016)
Google Scholar
Bhunia, A.K., Chowdhury, P.N., Yang, Y., Hospedales, T.M., Xiang, T., Song, Y.Z.: Vectorization and rasterization: self-supervised learning for sketch and handwriting. In: CVPR (2021)
Google Scholar
Bhunia, A.K., et al.: Pixelor: a competitive sketching AI agent. So you think you can beat me? In: SIGGRAPH Asia (2020)
Google Scholar
Bhunia, A.K., et al.: Doodle it yourself: class incremental learning by drawing a few sketches. In: CVPR (2022)
Google Scholar
Bhunia, A.K., et al.: Sketching without worrying: Noise-tolerant sketch-based image retrieval. In: CVPR (2022)
Google Scholar
Bhunia, A.K., et al.: Adaptive fine-grained sketch-based image retrieval. In: ECCV (2022)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: CVPR (2018)
Google Scholar
Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. arXiv preprint arXiv:2102.10407 (2021)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv preprint arXiv:1606.00915 (2016)
Chowdhury, P.N., Bhunia, A.K., Gajjala, V.R., Sain, A., Xiang, T., Song, Y.Z.: Partially does it: towards scene-level FG-SBIR with partial input. In: CVPR (2022)
Google Scholar
Das, A., Yang, Y., Hospedales, T., Xiang, T., Song, Y.-Z.: BézierSketch: a generative model for scalable vector sketches. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 632–647. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_38
Chapter Google Scholar
Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: WMT@ACL (2014)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: Nice: non-linear independent components estimation. In: ICLR, Workshop Track Proc (2015)
Google Scholar
Eitz, M., Hays, J., Alexa, M.: How do humans sketch objects? ACM Trans. Graph. (2012)
Google Scholar
Gao, C., Liu, Q., Wang, L., Liu, J., Zou, C.: Sketchycoco: image generation from freehand scene sketches. In: CVPR (2020)
Google Scholar
Ge, S., Goswami, V., Zitnick, C.L., Parikh, D.: Creative sketch generation. In: ICLR (2021)
Google Scholar
Gryaditskaya, Y., Hähnlein, F., Liu, C., Sheffer, A., Bousseau, A.: Lifting freehand concept sketches into 3D. In: SIGGRAPH Asia (2020)
Google Scholar
Gryaditskaya, Y., Sypesteyn, M., Hoftijzer, J.W., Pont, S., Durand, F., Bousseau, A.: Opensketch: a richly-annotated dataset of product design sketches. ACM Trans. Graph. (2019)
Google Scholar
Ha, D., Eck, D.: A neural representation of sketch drawings. In: ICLR (2018)
Google Scholar
Hertzmann, A.: Why do line drawings work? Perception (2020)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)
Google Scholar
Holinaty, J., Jacobson, A., Chevalier, F.: Supporting reference imagery for digital drawing. In: ICCV Workshop (2021)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE-TPAMI (2017)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, F., et al.: SceneSketcher: fine-grained image retrieval with scene sketches. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 718–734. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_42
Chapter Google Scholar
Liu, K., Li, Y., Xu, N., Nataranjan, P.: Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730 (2018)
Mahajan, S., Gurevych, I., Roth, S.: Latent normalizing flows for many-to-many cross-domain mappings. In: ICLR (2020)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NIPS (2011)
Google Scholar
Pang, K., Yang, Y., Hospedales, T.M., Xiang, T., Song, Y.Z.: Solving mixed-modal jigsaw puzzle for fine-grained sketch-based image retrieval. In: CVPR (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: feature learning by inpainting. In: CVPR (2016)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Google Scholar
Qi, A., et al.: Toward fine-grained sketch-based 3D shape retrieval. IEEE-TIP 30, 8595–8606 (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020 (2021)
Sain, A., Bhunia, A.K., Potlapalli, V., Chowdhury, P.N., Xiang, T., Song, Y.Z.: Sketch3T: test-time training for zero-shot SBIR. In: CVPR (2022)
Google Scholar
Sangkloy, P., Burnell, N., Ham, C., Hays, J.: The sketchy database: learning to retrieve badly drawn bunnies. ACM Trans. Graph. (2016)
Google Scholar
Schneider, R.G., Tuytelaars, T.: Sketch classification and classfication-driven analysis using fisher vectors. In: SIGGRAPH Asia (2014)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Song, J., Song, Y.Z., Xiang, T., Hospedales, T.M.: Fine-grained image retrieval: the text/sketch input dilemma. In: BMVC (2017)
Google Scholar
Song, J., Yu, Q., Song, Y.Z., Xiang, T., Hospedales, T.M.: Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: ICCV (2017)
Google Scholar
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., Najork, M.: Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. arXiv preprint arXiv:2103.01913 (2021)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)
Google Scholar
Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: CVPR (2014)
Google Scholar
Wang, L., Schwing, A.G., Lazebnik, S.: Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In: NeurIPS (2017)
Google Scholar
Wang, S.Y., Bau, D., Zhu, J.Y.: Sketch your own GAN. In: ICCV (2021)
Google Scholar
Wang, T.Y., Ceylan, D., Popovic, J., Mitra, N.J.: Learning a shared shape space for multimodal garment design. In: SIGGRAPH Asia (2018)
Google Scholar
Wang, Z., Qiu, S., Feng, N., Rushmeier, H., McMillan, L., Dorsey, J.: Tracing versus freehand for evaluating computer-generated drawings. ACM Trans. Graph. 40(4), 1–12 (2021)
Google Scholar
Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_31
Chapter Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Yan, C., Vanderhaeghe, D., Gingold, Y.: A benchmark for rough sketch cleanup. ACM Trans. Graph. 39(6), 1–14 (2020)
Google Scholar
Yu, Q., Liu, F., Song, Y.Z., Xiang, T., Hospedales, T.M., Loy, C.C.: Sketch me that shoe. In: CVPR (2016)
Google Scholar
Yu, Q., Yang, Y., Song, Y.Z., Xiang, T., Hospedales, T.: Sketch-a-net that beats humans. In: BMVC (2015)
Google Scholar
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Chapter Google Scholar
Zou, C., et al.: Sketchyscene: Rickly-annotated scene sketches. In: ECCV (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

SketchX, CVSSP, University of Surrey, Guildford, UK
Pinaki Nath Chowdhury, Aneeshan Sain, Ayan Kumar Bhunia, Tao Xiang, Yulia Gryaditskaya & Yi-Zhe Song
iFlyTek-Surrey Joint Research Centre on Artificial Intelligence, Guildford, UK
Pinaki Nath Chowdhury, Aneeshan Sain, Tao Xiang & Yi-Zhe Song
Surrey Institute for People Centred AI, CVSSP, University of Surrey, Guildford, UK
Yulia Gryaditskaya

Authors

Pinaki Nath Chowdhury
View author publications
You can also search for this author in PubMed Google Scholar
Aneeshan Sain
View author publications
You can also search for this author in PubMed Google Scholar
Ayan Kumar Bhunia
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Yulia Gryaditskaya
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Zhe Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pinaki Nath Chowdhury .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3717 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chowdhury, P.N., Sain, A., Bhunia, A.K., Xiang, T., Gryaditskaya, Y., Song, YZ. (2022). FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13668. Springer, Cham. https://doi.org/10.1007/978-3-031-20074-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-20074-8_15
Published: 12 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20073-1
Online ISBN: 978-3-031-20074-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics