E5CE GitHub - HVision-NKU/ASID-Caption: ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding. · GitHub
[go: up one dir, main page]

Skip to content

HVision-NKU/ASID-Caption

Repository files navigation

ASID-Caption: Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Yunheng Li1 · Hengrui Zhang1 · Meng-Hao Guo3 · Wenzhao Gao2 · Shaoyong Jia2 · Shaohui Jiao2 ·Qibin Hou1† · Ming-Ming Cheng1

1VCIP, Nankai University 2 ByteDance Inc. 3 Tsinghua University

†Corresponding author

Paper PDF Project Page


✨ Overview

Existing video instruction datasets often treat each video as a single unstructured caption, which leads to incomplete descriptions and makes it hard to learn controllable, fine-grained understanding. Simply making captions longer can introduce more hallucinations without systematic verification.

Our key idea is to provide attribute-structured supervision and verify each attribute against audiovisual evidence, enabling more reliable fine-grained learning.

🎬 Captioning Case of ASID-Caption

🚀 Getting Started

1. Clone the repository

First, clone the project and navigate into the directory:

git clone https://github.com/HVision-NKU/ASID-Caption.git
cd ASID-Caption

2. Set Up the Environment

Requires Python 3.11 (pre-installed).

2.1 inference

pip install torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
pip install transformers==4.57.0 qwen-omni-utils accelerate
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
sudo apt update && sudo apt install -y ffmpeg

Single Video Inference

python demo_inference.py assets/titanic.mp4

Batch Video Inference

python batch_inference.py --video_dir /demo_test --model_path Model/ASID-Captioner-3B

2.1 Training

cd ms-swift-3.9.3/
pip install -e .
pip install deepspeed==0.18.3 liger-kernel==0.6.4

Stage 1-2

bash train_qwen2.5-omi-stage1-2.sh

Stage 3

bash train_qwen2.5-omi-stage3.sh

📈 Benchmark Evaluation

Audiovisual Caption

  1. video-SALMONN2-testset:

    cd eval_scripts/video-SALMONN2-testset
    bash eval_video-SALMONN2-test.sh
  2. UGC-VideoCap:

    cd eval_scripts/UGC-VideoCap
    bash eval_UGC-VideoCap.sh

QA-based Audiovisual Caption

  1. Daily-Omni:

    cd eval_scripts/Daily-Omni/
    bash Daily-Omni_pipeline.sh
  2. WorldSense:

    cd eval_scripts/WorldSense/
    bash WorldSense_pipeline.sh

Visual-only Caption

  1. VDC:

    cd eval_scripts/VDC/VDC.sh
    bash VDC.sh
  2. VidCapBench-AE

    cd eval_scripts/VidCapBench-AE/
    bash VidCapBench.sh

Caption-based Temporal Grounding

Charades-STA

cd eval_scripts/Charades/
bash Charades.sh

Attribute-based Instruction Following

python eval_scripts/Attrbute/evaluation.py --caption_file pred.jsonl --prompt_file eval_scripts/Attrbute/prompts.jsonl

🔥 Results

We provide detailed quantitative results on different benchmarks and settings as shown below.

Experimental Results
Experimental Results
Experimental Results

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citing our paper. We appreciate your support!

@article{li2026towards,
  title={Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions},
  author={Li, Yunheng and Zhang, Hengrui and Guo, Meng-Hao and Gao, Wenzhao and Jia, Shaoyong and Jiao, Shaohui and Hou, Qibin and Cheng, Ming-Ming},
  journal={arXiv preprint arXiv:2602.13013},
  year={2026}
}

About

ASID-Caption: Attribute-Structured and Quality-Verified Audiovisual Instruction Dataset and Training Pipeline for Fine-Grained Video Understanding.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

0