[go: up one dir, main page]

Skip to main content

Text-to-Feature Diffusion for Audio-Visual Few-Shot Learning

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2023)

Abstract

Training deep learning models for video classification from audio-visual data commonly requires vast amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for the few-shot video classification task. Therefore, we introduce a unified audio-visual few-shot video classification benchmark on three datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we adapt and compare ten methods. In addition, we propose AV-Diff, a text-to-feature diffusion framework, which first fuses the temporal and audio-visual features via cross-modal attention and then generates multi-modal features for the novel classes. We show that AV-Diff obtains state-of-the-art performance on our proposed benchmark for audio-visual (generalised) few-shot learning. Our benchmark paves the way for effective audio-visual classification when only limited labeled data is available. Code and data are available at https://github.com/ExplainableML/AVDIFF-GFSL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016)

  2. Adler, J., Lunz, S.: Banach Wasserstein GAN. In: NeurIPS (2018)

    Google Scholar 

  3. Afouras, T., Asano, Y.M., Fagan, F., Vedaldi, A., Metze, F.: Self-supervised object detection from audio-visual correspondence. In: CVPR (2022)

    Google Scholar 

  4. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE TPAMI 44(12), 8717–8727 (2018)

    Article  Google Scholar 

  5. Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading. In: ICASSP (2020)

    Google Scholar 

  6. Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: ECCV (2020)

    Google Scholar 

  7. Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)

    Google Scholar 

  8. Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)

    Google Scholar 

  9. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)

    Google Scholar 

  10. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: NeurIPS (2016)

    Google Scholar 

  11. Bishay, M., Zoumpourlis, G., Patras, I.: Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. In: BMVC (2019)

    Google Scholar 

  12. Blattmann, A., Rombach, R., Oktay, K., Müller, J., Ommer, B.: Semi-parametric neural image synthesis. In: NeurIPS (2022)

    Google Scholar 

  13. Bo, Y., Lu, Y., He, W.: Few-shot learning of video action recognition only based on video contents. In: WACV (2020)

    Google Scholar 

  14. Boes, W., Van hamme, H.: Audiovisual transformer architectures for large-scale classification and synchronization of weakly labeled audio events. In: ACM MM (2019)

    Google Scholar 

  15. Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: CVPR (2020)

    Google Scholar 

  16. Carreira, J., et al.: Hierarchical perceiver. arXiv preprint arXiv:2202.10890 (2022)

  17. Chao, W.L., Changpinyo, S., Gong, B., Sha, F.: An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In: ECCV (2016)

    Google Scholar 

  18. Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: CVPR (2021)

    Google Scholar 

  19. Chen, H., Xie, W., Vedaldi, A., Zisserman, A.: Vggsound: a large-scale audio-visual dataset. In: ICASSP (2020)

    Google Scholar 

  20. Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at few-shot classification. arXiv:1904.04232 (2019)

  21. Chen, Y., Xian, Y., Koepke, A.S., Shan, Y., Akata, Z.: Distilling audio-visual knowledge by compositional contrastive learning. In: CVPR (2021)

    Google Scholar 

  22. Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: NeurIPS (2021)

    Google Scholar 

  23. Douze, M., Szlam, A., Hariharan, B., Jégou, H.: Low-shot learning with large-scale diffusion. In: CVPR (2018)

    Google Scholar 

  24. Esser, P., Rombach, R., Blattmann, A., Ommer, B.: Imagebart: bidirectional context with multinomial diffusion for autoregressive image synthesis. In: NeurIPS (2021)

    Google Scholar 

  25. Fayek, H.M., Kumar, A.: Large scale audiovisual learning of sounds with weakly labeled data. In: IJCAI (2020)

    Google Scholar 

  26. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)

    Google Scholar 

  27. Gan, C., Huang, D., Chen, P., Tenenbaum, J.B., Torralba, A.: Foley music: learning to generate music from videos. In: ECCV (2020)

    Google Scholar 

  28. Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)

    Google Scholar 

  29. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv:1508.06576 (2015)

  30. Goldstein, S., Moses, Y.: Guitar music transcription from silent video. In: BMVC (2018)

    Google Scholar 

  31. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  32. Hariharan, B., Girshick, R.: Low-shot visual recognition by shrinking and hallucinating features. In: ICCV (2017)

    Google Scholar 

  33. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)

    Google Scholar 

  34. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google Scholar 

  35. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  36. Iashin, V., Rahtu, E.: A better use of audio-visual cues: Dense video captioning with bi-modal transformer. In: BMVC (2020)

    Google Scholar 

  37. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)

    Google Scholar 

  38. Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., Carreira, J.: Perceiver: general perception with iterative attention. In: ICML (2021)

    Google Scholar 

  39. Kang, B., et al.: Decoupling representation and classifier for long-tailed recognition. In: ICLR (2020)

    Google Scholar 

  40. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  41. Kim, S., Choi, D.W.: Better generalized few-shot learning even without base data. arXiv:2211.16095 (2022)

  42. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)

  43. Koepke, A.S., Wiles, O., Moses, Y., Zisserman, A.: Sight to sound: an end-to-end approach for visual piano transcription. In: ICASSP (2020)

    Google Scholar 

  44. Koepke, A.S., Wiles, O., Zisserman, A.: Visual pitch estimation. In: SMC (2019)

    Google Scholar 

  45. Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: NeurIPS (2018)

    Google Scholar 

  46. Kumar Dwivedi, S., Gupta, V., Mitra, R., Ahmed, S., Jain, A.: Protogan: towards few shot learning for action recognition. In: ICCVW (2019)

    Google Scholar 

  47. Li, X., et al.: Learning to self-train for semi-supervised few-shot classification. In: NeurIPS (2019)

    Google Scholar 

  48. Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)

    Google Scholar 

  49. Liu, Y., et al.: Learning to propagate labels: transductive propagation network for few-shot learning. arXiv:1805.10002 (2018)

  50. Majumder, S., Chen, C., Al-Halah, Z., Grauman, K.: Few-shot audio-visual learning of environment acoustics. In: NeurIPS (2022)

    Google Scholar 

  51. Mercea, O.B., Hummel, T., Koepke, A.S., Akata, Z.: Temporal and cross-modal attention for audio-visual zero-shot learning. In: ECCV (2022)

    Google Scholar 

  52. Mercea, O.B., Riesch, L., Koepke, A.S., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: CVPR (2022)

    Google Scholar 

  53. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)

    Google Scholar 

  54. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)

  55. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., Sun, C.: Attention bottlenecks for multimodal fusion. In: NeurIPS (2021)

    Google Scholar 

  56. Narasimhan, M., Ginosar, S., Owens, A., Efros, A.A., Darrell, T.: Strumming to the beat: audio-conditioned contrastive video textures. arXiv:2104.02687 (2021)

  57. Narayan, S., Gupta, A., Khan, F.S., Snoek, C.G., Shao, L.: Latent embedding feedback and discriminative features for zero-shot classification. In: ECCV (2020)

    Google Scholar 

  58. Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: ECCV (2018)

    Google Scholar 

  59. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: ECCV (2016)

    Google Scholar 

  60. Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Learning sight from sound: ambient sound provides supervision for visual learning. In: IJCV (2018)

    Google Scholar 

  61. Patrick, M., Asano, Y.M., Fong, R., Henriques, J.F., Zweig, G., Vedaldi, A.: Multi-modal self-supervision from generalized data transformations. In: NeurIPS (2020)

    Google Scholar 

  62. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. In: CVPR (2021)

    Google Scholar 

  63. Qi, H., Brown, M., Lowe, D.G.: Low-shot learning with imprinted weights. In: CVPR (2018)

    Google Scholar 

  64. Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: ICLR (2017)

    Google Scholar 

  65. Recasens, A., et al.: Zorro: the masked multimodal transformer. arXiv preprint arXiv:2301.09595 (2023)

  66. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  67. Roy, A., Shah, A., Shah, K., Roy, A., Chellappa, R.: Diffalign: few-shot learning using diffusion based synthesis and alignment. arXiv preprint arXiv:2212.05404 (2022)

  68. Saxena, D., Cao, J.: Generative adversarial networks (GANs) challenges, solutions, and future directions. ACM Comput. Surv. (CSUR) 54(3), 1–42 (2021)

    Article  Google Scholar 

  69. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: NeurIPS (2017)

    Google Scholar 

  70. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  71. Su, K., Liu, X., Shlizerman, E.: Multi-instrumentalist net: unsupervised generation of music from body movements. arXiv:2012.03478 (2020)

  72. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: relation network for few-shot learning. In: CVPR (2018)

    Google Scholar 

  73. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS (2020)

    Google Scholar 

  74. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)

    Google Scholar 

  75. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  76. Vahdat, A., Kreis, K., Kautz, J.: Score-based generative modeling in latent space. In: NeurIPS (2021)

    Google Scholar 

  77. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  78. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: NeurIPS (2016)

    Google Scholar 

  79. Wang, X., Zhu, L., Yang, Y.: T2VLAD: global-local sequence alignment for text-video retrieval. In: CVPR (2021)

    Google Scholar 

  80. Wang, Y., Chao, W.L., Weinberger, K.Q., van der Maaten, L.: Simpleshot: revisiting nearest-neighbor classification for few-shot learning. arXiv:1911.04623 (2019)

  81. Wang, Y.X., Girshick, R., Hebert, M., Hariharan, B.: Low-shot learning from imaginary data. In: CVPR (2018)

    Google Scholar 

  82. Xian, Y., Korbar, B., Douze, M., Torresani, L., Schiele, B., Akata, Z.: Generalized few-shot video classification with video retrieval and feature generation. IEEE TPAMI 44(12), 8949–8961 (2021)

    Article  Google Scholar 

  83. Xian, Y., Sharma, S., Schiele, B., Akata, Z.: F-VAEGAN-D2: a feature generating framework for any-shot learning. In: CVPR (2019)

    Google Scholar 

  84. Xiao, F., Lee, Y.J., Grauman, K., Malik, J., Feichtenhofer, C.: Audiovisual slowfast networks for video recognition. arXiv:2001.08740 (2020)

  85. Ye, H.J., Hu, H., Zhan, D.C., Sha, F.: Few-shot learning via embedding adaptation with set-to-set functions. In: CVPR (2020)

    Google Scholar 

  86. Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: ECCV (2020)

    Google Scholar 

  87. Zhang, Y.K., Zhou, D.W., Ye, H.J., Zhan, D.C.: Audio-visual generalized few-shot learning with prototype-based co-adaptation. In: Proceedings of Interspeech 2022 (2022)

    Google Scholar 

  88. Zhou, H., Liu, Z., Xu, X., Luo, P., Wang, X.: Vision-infused deep audio inpainting. In: ICCV (2019)

    Google Scholar 

  89. Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: ECCV (2018)

    Google Scholar 

Download references

Acknowledgements

This work was supported by BMBF FKZ: 01IS18039A, DFG: SFB 1233 TP 17 - project number 276693517, by the ERC (853489 - DEXIM), and by EXC number 2064/1 - project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting O.-B. Mercea and T. Hummel.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Otniel-Bogdan Mercea .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 401 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mercea, OB., Hummel, T., Koepke, A.S., Akata, Z. (2024). Text-to-Feature Diffusion for Audio-Visual Few-Shot Learning. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54605-1_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54604-4

  • Online ISBN: 978-3-031-54605-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics