Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.13611 (cs)

[Submitted on 21 Apr 2024 (v1), last revised 1 Jun 2024 (this version, v2)]

Title:Video sentence grounding with temporally global textual knowledge

Authors:Cai Chen, Runzhong Zhang, Jianjun Gao, Kejun Wu, Kim-Hui Yap, Yi Wang

Abstract:Temporal sentence grounding involves the retrieval of a video moment with a natural language query. Many existing works directly incorporate the given video and temporally localized query for temporal grounding, overlooking the inherent domain gap between different modalities. In this paper, we utilize pseudo-query features containing extensive temporally global textual knowledge sourced from the same video-query pair, to enhance the bridging of domain gaps and attain a heightened level of similarity between multi-modal features. Specifically, we propose a Pseudo-query Intermediary Network (PIN) to achieve an improved alignment of visual and comprehensive pseudo-query features within the feature space through contrastive learning. Subsequently, we utilize learnable prompts to encapsulate the knowledge of pseudo-queries, propagating them into the textual encoder and multi-modal fusion module, further enhancing the feature alignment between visual and language for better temporal grounding. Extensive experiments conducted on the Charades-STA and ActivityNet-Captions datasets demonstrate the effectiveness of our method.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2404.13611 [cs.CV]
	(or arXiv:2404.13611v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.13611

Submission history

From: Chen Cai [view email]
[v1] Sun, 21 Apr 2024 10:41:04 UTC (1,692 KB)
[v2] Sat, 1 Jun 2024 06:56:16 UTC (4,270 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Video sentence grounding with temporally global textual knowledge

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Video sentence grounding with temporally global textual knowledge

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators