[go: up one dir, main page]

Skip to main content

Semantic Center Guided Windows Attention Fusion Framework for Food Recognition

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13535))

Included in the following conference series:

Abstract

Food recognition has attracted a great deal of attention in computer vision due to its potential applications for health and business. The challenge of food recognition is that food has no fixed spatial structure or common semantic patterns. In this paper, we propose a new semantic center guided window attention fusion framework (SCG-WAFM) for food recognition. The proposed Windows Attention Fusion Module (WAFM) utilizes the innate self-attention mechanism of Transformer to adaptively select the discriminative region without additional box annotation in training. The WAFM fuses the windows attention of Swin Transformer, crops the attention region from raw images and then scales up the region as the input of next network to iteratively learn discriminative features. In addition, the names of food categories contain important textual information, such as the major ingredients, cooking methods and so on, which are easily accessible and helpful for food recognition. Therefore, we propose Semantic Center loss Guidance(SCG) which utilizes the context-sensitive semantic embedding of food labels as category centers in feature space to guide the image features. We conduct extensive experiments on three popular food datasets and our proposed method achieves the state-of-the-art performance in Top-1 accuracy, demonstrating the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aguilar, E., Remeseiro, B., Bolaños, M., Radeva, P.: Grab, pay, and eat: semantic food detection for smart restaurants. IEEE Trans. Multim. 20(12), 3266–3275 (2018)

    Article  Google Scholar 

  2. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29

  3. Chen, J., Ngo, C.W.: Deep-based ingredient recognition for cooking recipe retrieval. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 32–41 (2016)

    Google Scholar 

  4. Chen, X., Zhu, Y., Zhou, H., Diao, L., Wang, D.: Chinesefoodnet: A large-scale image dataset for Chinese food recognition. arXiv preprint arXiv:1705.02743 (2017)

  5. Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922 (2020)

  6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2021)

  8. Fu, J., Zheng, H., Mei, T.: Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4476–4484 (2017)

    Google Scholar 

  9. Hassannejad, H., Matrella, G., Ciampolini, P., De Munari, I., Mordonini, M., Cagnoni, S.: Food image recognition using very deep convolutional networks. In: Proceedings of the 2nd International Workshop on Multimedia Assisted Dietary Management, pp. 41–49 (2016)

    Google Scholar 

  10. He, J., et al.: Transfg: a transformer architecture for fine-grained recognition. arXiv preprint arXiv:2103.07976 (2021)

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  12. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vsion and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  13. Hu, Y., et al.: Rams-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 4239–4248 (2021)

    Google Scholar 

  14. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017)

    Google Scholar 

  15. Huang, Y., et al.: Gpipe: efficient training of giant neural networks using pipeline parallelism. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  16. Jiang, S., Min, W., Liu, L., Luo, Z.: Multi-scale multi-view deep feature aggregation for food recognition. IEEE Trans. Image Process. 29, 265–276 (2019)

    Article  MathSciNet  Google Scholar 

  17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2012)

    Article  Google Scholar 

  18. Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., Ma, Y.: DeepFood: deep learning-based food image recognition for computer-aided dietary assessment. In: Chang, C.K., Chiari, L., Cao, Y., Jin, H., Mokhtari, M., Aloulou, H. (eds.) ICOST 2016. LNCS, vol. 9677, pp. 37–48. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39601-9_4

  19. Liu, Z., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)

    Google Scholar 

  20. Martinel, N., Foresti, G.L., Micheloni, C.: Wide-slice residual networks for food recognition. In: 2018 IEEE Winter Conference on applications of computer vision (WACV), pp. 567–576. IEEE (2018)

    Google Scholar 

  21. Min, W., Liu, L., Luo, Z., Jiang, S.: Ingredient-guided cascaded multi-attention network for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1331–1339 (2019)

    Google Scholar 

  22. Min, W., et al.: ISIA food-500: a dataset for large-scale food recognition via stacked global-local attention network. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 393–401 (2020)

    Google Scholar 

  23. Min, W., et al.: Large scale visual food recognition. arXiv preprint arXiv:2103.16107 (2021)

  24. Myers, A., et al.: Im2calories: towards an automated mobile vision food diary. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1233–1241 (2015)

    Google Scholar 

  25. Qiu, J., Lo, F.P.W., Sun, Y., Wang, S., Lo, B.P.L.: Mining discriminative food regions for accurate food recognition. In: British Machine Vision Association (BMVC) (2019)

    Google Scholar 

  26. Salvador, A., Hynes, N., Aytar, Y., Marín, J., Ofli, F., Weber, I., Torralba, A.: Learning cross-modal embeddings for cooking recipes and food images. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3068–3076 (2017)

    Google Scholar 

  27. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015)

    Google Scholar 

  28. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  29. Wei, X.S., Xie, C.W., Wu, J.: Mask-CNN: localizing parts and selecting descriptors for fine-grained image recognition. arXiv preprint arXiv:1605.06878 (2016)

  30. Zhao, H., Yap, K.H., Kot, A.C.: Fusion learning using semantics and graph convolutional network for visual food recognition. In: 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1710–1719 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenxiong Kang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8541 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, Y., Chen, J., Zhang, X., Kang, W., Ming, Z. (2022). Semantic Center Guided Windows Attention Fusion Framework for Food Recognition. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13535. Springer, Cham. https://doi.org/10.1007/978-3-031-18910-4_50

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18910-4_50

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18909-8

  • Online ISBN: 978-3-031-18910-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics