Important
The currently available checkpoints are undertrained and lack visual-auditory alignment data due to source constraints, which may lead to unpredictable behavior in some cases
2025-04-02First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
Table of Contents
Since there isn't a fully open-sourced repository for a GPT-4o-like end-to-end omni model with complete training code and data, I've made this repository public in the hope that it will be useful to the community. The entire code logic is based on LLaVA.
Must Read
- Visual Instruction Tuning,
visual post-training - LLaMA-Omni: Seamless Speech Interaction with Large Language Models,
speech/audio post-training
This codebase is tested on CUDA 11.8 and A800-80G.
conda create -n open_omni python=3.10 -y && conda activate open_omni
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e ".[train]"
pip install packaging && pip install ninja && pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install -r requirements.txtAdditionally, Install fairseq for speech units process
possible environment issues
- flash-attn install fail, try install from source:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl - fairseq env conflict:
pip install pip==24
Data download:
- Download LLaVA-NeXT-Data for visual instruction tuning
- Download VoiceAssistant-400K for speech instruction tuning.
You can sample a portion of these datasets or collect more data based on your computational resources.
Data preprocess:
- Install CosyVoice or ChatTTS, for speech synthesis and test speech generation. If you are interested in the process of the speech instruction synthesis, you can refer to the scripts in
preprocess/tts - Download mHuBERT and K-means Model to
checkpoints/quantizerfor speech units generation. You can refer to the scripts inpreprocess/quantizefor the speech unit generation process. We provide processed samples from VoiceAssistant for convenience, check it
optional: In addition, to assist with visual-audio instruction tuning, we convert user queries from LLaVA-NeXT into audio using CosyVoice. If you are interested in the process of the construction of audio instruction, you can refer to the scripts in preprocess/tts.
Data sample
{
"id": "000000240632",
"image": "000000240632.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\n"
},
{
"from": "human",
"value": "<speech>\n"
},
{
"from": "gpt",
"value": "Hi, I am Open-Omni, the video show ...",
"tgt_units": [497, 300, 63, ...]
},
],
"speech": [
"000000240632_0.wav",
"000000240632_1.wav"
]
},The final data is organized in the following format:
open_omni/inputs
├── images/ # images
└── llava-next/
├── ...
└── xxxx.jpg
├── speech/
├── voiceassistant/
├── ...
└── xxxx.wav
└── interinst/
├── ...
└── xxxx.wav
└── texts/
├── llava_next_audio.json
├── llava_next_audio_units.json
├── voiceassistant.json
└── voiceassistant_units.json
- Supported LLM Backbone
- Qwen2-7B-Instruct
- Llama-3.1-8B-Instruct
- LongVA-7B, for quick speech adaptation.
- Llama-3.1-8B-Omni, for quick visual adaptation.
- Supported Vision Encoder
- Supported Speech Encoder
- Miscs
- HiFi-GAN vocoder with config
Place them in the open_omni/checkpoints directory.
open_omni/checkpoints
├── Qwen2-7B-Instruct
├── Meta-Llama-3.1-8B-Instruct
├── clip-vit-large-patch14-336
├── whisper/large-v3.pt
└── vocoder
├── config.json
└── g_00500000
Important
THIS IS A TRIAL: Train OpenOmni in a single end-to-end process from LLM by creating visual-audio-speech data, then running scripts/finetune_openomni_os.sh.
Or enhance existing models with the ability to see, hear, and speak by:
1. Visual instruction tuning
We provide the default visual instruction tuning pipeline for LLMs
If you wish to use other LLMs or instruction tuning data, feel free to follow the LLaVA-NeXT pipeline. Here, we provide a pipeline to do visual instruction tuning on Qwen2-7B-Instruct or Llama-3.1-8B using the datasets blip_laion_cc_sbu_558k, LLaVA-NeXT-Data, and ShareGPTVideo. Feel free to adapt it to other models.
cd open_omni
bash scripts/lvlm_pretrain.sh
bash scripts/lvlm_finetune.sh
bash scripts/lvlm_dpo.shAlternatively, you can use an off-the-shelf speech model like Llama-3.1-8B-Omni and enhance it with visual understanding by running
bash scripts/finetune_visual.sh2. Audio/Speech instruction tuning
Similarly, you can use an LLM or off-the-shelf vis 8000 ual model like LongVA-7B to enhace it with auditory understanding by running
bash scripts/finetune_auditory.shTo assist those with limited computational resources, we also provide an off-the-shelf checkpoint. Check it out at
We can combine step 1 and step 2 to perform visual-audio instruction tuning simultaneously
To enhance the model's visual-audio understanding capabilities, we offer a script to fine-tune it using the a synthetic dataset, which convert the queries of llava-next to speech by CosyVoice dataset. This aims to improve visual-audio alignment performance. (This process takes ~140 hours on 4 A800 GPU)
NOTE: We find that this process is more prone to collapse than audio instruction tuning alone, so we provide a model just for further study.
bash scripts/finetune_visionaudio.shFor those with limited computational resources, we also provide a ready-to-use checkpoint (17500 step). You can access it here
Try the visual-audio base model through python -m local_demo.baseline_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "local_demo/wav/water.mp4.wav"
3. Speech generator tuning
For speech generation, we adopt the tuning strategy from LLaMA-Omni, utilizing the connectionist temporal classification (CTC) loss to align the hidden states of the LLM with discrete speech units extracted by the HuBERT and K-means models.
bash scritps/finetune_speechgen.shAs a result, you can flexibly enhance your model to progressively approach an omni model by following one of the previous post-training steps.
We demonstrate a usage example for our OpenOmni-7B-Qwen2-Omni model, which is fine-tuned from LongVA using VoiceAssistant (100K).
import os
import json
from PIL import Image
import numpy as np
import torchaudio
import torch
from decord import VideoReader, cpu
import whisper
import soundfile as sf
# fix seed
torch.manual_seed(0)
from fairseq import utils as fairseq_utils
from fairseq.models.text_to_speech.vocoder import CodeHiFiGANVocoder
from open_omni.model.builder import load_pretrained_model
from open_omni.mm_utils import tokenizer_image_speech_tokens, process_images, ctc_postprocess
from open_omni.constants import IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX
import warnings
warnings.filterwarnings("ignore")
# config OpenOmni
model_path = "checkpoints/OpenOmni-7B-Qwen2-Omni"
video_path = "local_demo/assets/water.mp4"
audio_path = "local_demo/wav/water.mp4.wav"
max_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)
gen_kwargs = {"do_sample": True, "temperature": 0.5, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024}
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_s2s_qwen", device_map="cuda:0") # for llama -> llava_s2s_llama
# config vocoder
with open("checkpoints/vocoder/config.json") as f:
vocoder_cfg = json.load(f)
vocoder = CodeHiFiGANVocoder("checkpoints/vocoder/g_00500000", vocoder_cfg).cuda()
# query input
query = "Give a detailed caption of the video as if I am blind."
query = None # comment this to use ChatTTS to convert the query to audio
#video input
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image><|im_end|>\n<|im_start|>user\n<speech>\n<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer_image_speech_tokens(prompt, tokenizer, IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
vr = VideoReader(video_path, ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frames = vr.get_batch(frame_idx).asnumpy()
video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)
#audio input
# process speech for input question
if query is not None:
import ChatTTS
chat = ChatTTS.Chat()
chat.load(source='local', compile=True)
audio_path = "./local_demo/wav/" + "infer.wav"
if os.path.exists(audio_path): os.remove(audio_path) # refresh
if not os.path.exists(audio_path):
wav = chat.infer(query)
try:
torchaudio.save(audio_path, torch.from_numpy(wav).unsqueeze(0), 24000)
except:
torchaudio.save(audio_path, torch.from_numpy(wav), 24000)
print(f"Human: {query}")
else:
print("Human: <audio>")
speech = whisper.load_audio(audio_path)
speech = whisper.pad_or_trim(speech)
speech = whisper.log_mel_spectrogram(speech, n_mels=128).permute(1, 0).to(device=model.device, dtype=torch.float16)
speech_length = torch.LongTensor([speech.shape[0]]).to(model.device)
with torch.inference_mode():
output_ids, output_units = model.generate(input_ids, images=[video_tensor], modalities=["video"], speeches=speech.unsqueeze(0), speech_lengths=speech_length, **gen_kwargs)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Agent: {outputs}")
output_units = ctc_postprocess(output_units, blank=model.config.unit_vocab_size)
output_units = [(list(map(int, output_units.strip().split())))]
print(f"Units: {output_units}")
x = {"code": torch.LongTensor(output_units[0]).view(1,-1)}
x = fairseq_utils.move_to_cuda(x)
wav = vocoder(x, True)
output_file_path = "local_demo/wav/output.wav"
sf.write(
output_file_path,
wav.detach().cpu().numpy(),
16000
)
print(f"The generated wav saved to {output_file_path}")- Launch a controller.
python -m local_demo.controller --host 0.0.0.0 --port 10000- Launch a gradio web server.
python -m local_demo.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder checkpoints/vocoder/g_00500000 --vocoder-cfg checkpoints/vocoder/config.jsonNOTE: for llama models change template_name from qwen_1_5 to llava_llama_3 in Line 115 in local_demo/gradio_web_server.py
- Launch a model worker.
python -m local_demo.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path checkpoints/OpenOmni-7B-Qwen2-Omni --model-name llava_s2s_qwen- Visit http://localhost:8000/
gradio_demo.mp4
- To collect high quanlity visual-audio-speech data
- To support streaming
- To support multilingual and more voice control
- To support more LLM/Vision Encoder/Speech Encoder/Speech Coder
- visual instruction tuning
- audio/speech instruction tuning
- speech generator tuning
If you find our repository helpful, please consider citing our new work.
@article{omnimmi,
title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
journal={CVPR},
year={2025}
}