Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.05726 (cs)

[Submitted on 8 Apr 2024 (v1), last revised 24 Apr 2024 (this version, v2)]

Title:MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Authors:Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim

Abstract:With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at this https URL.

Comments:	Accepted at CVPR 2024. Project Page this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.05726 [cs.CV]
	(or arXiv:2404.05726v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.05726

Submission history

From: Bo He [view email]
[v1] Mon, 8 Apr 2024 17:59:24 UTC (4,000 KB)
[v2] Wed, 24 Apr 2024 15:38:48 UTC (4,003 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators