10000 GitHub - OmniMMI/OpenOmniNexus: a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.
[go: up one dir, main page]

Skip to content

OmniMMI/OpenOmniNexus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open Omni Nexus

Build

Open Omni

Important

The currently available checkpoints are undertrained and lack visual-auditory alignment data due to source constraints, which may lead to unpredictable behavior in some cases

Updates

  • 2025-04-02 First Release Open-Omni-Nexus. a fully open-source implementation of a GPT-4o-like speech-to-speech video understanding model.

Table of Contents

Introduction

Since there isn't a fully open-sourced repository for a GPT-4o-like end-to-end omni model with complete training code and data, I've made this repository public in the hope that it will be useful to the community. The entire code logic is based on LLaVA.

Prior Knowledge

Must Read

Training

Installation

This codebase is tested on CUDA 11.8 and A800-80G.

conda create -n open_omni python=3.10 -y && conda activate open_omni
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e ".[train]"
pip install packaging &&  pip install ninja && pip install flash-attn==2.6.3 --no-build-isolation --no-cache-dir
pip install -r requirements.txt

Additionally, Install fairseq for speech units process

possible environment issues
  • flash-attn install fail, try install from source: pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
  • fairseq env conflict: pip install pip==24

Data Preparation

Data download:

You can sample a portion of these datasets or collect more data based on your computational resources.

Data preprocess:

  • Install CosyVoice or ChatTTS, for speech synthesis and test speech generation. If you are interested in the process of the speech instruction synthesis, you can refer to the scripts in preprocess/tts
  • Download mHuBERT and K-means Model to checkpoints/quantizer for speech units generation. You can refer to the scripts in preprocess/quantize for the speech unit generation process. We provide processed samples from VoiceAssistant for convenience, check it Dataset

optional: In addition, to assist with visual-audio instruction tuning, we convert user queries from LLaVA-NeXT into audio using CosyVoice. If you are interested in the process of the construction of audio instruction, you can refer to the scripts in preprocess/tts.

Data sample

    {
        "id": "000000240632",
        "image": "000000240632.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\n"
            },
            {
                "from": "human",
                "value": "<speech>\n"
            },
            {
                "from": "gpt",
                "value": "Hi, I am Open-Omni, the video show ...",
                "tgt_units": [497, 300, 63, ...]

            },
        ],
        "speech": [
            "000000240632_0.wav",
            "000000240632_1.wav"
        ]
    },

The final data is organized in the following format:

open_omni/inputs  
    ├── images/ # images
      └── llava-next/
        ├── ...
        └── xxxx.jpg
    ├── speech/
      ├── voiceassistant/
        ├── ...
        └── xxxx.wav
      └── interinst/
        ├── ...
        └── xxxx.wav
    └── texts/
      ├── llava_next_audio.json
      ├── llava_next_audio_units.json
      ├── voiceassistant.json
      └── voiceassistant_units.json

Pretrained Backbone Preparation

Place them in the open_omni/checkpoints directory.

open_omni/checkpoints  
    ├── Qwen2-7B-Instruct
    ├── Meta-Llama-3.1-8B-Instruct
    ├── clip-vit-large-patch14-336
    ├── whisper/large-v3.pt
    └── vocoder
        ├── config.json
        └── g_00500000

Start Training

Important

THIS IS A TRIAL: Train OpenOmni in a single end-to-end process from LLM by creating visual-audio-speech data, then running scripts/finetune_openomni_os.sh.

Or enhance existing models with the ability to see, hear, and speak by:

1. Visual instruction tuning

We provide the default visual instruction tuning pipeline for LLMs

If you wish to use other LLMs or instruction tuning data, feel free to follow the LLaVA-NeXT pipeline. Here, we provide a pipeline to do visual instruction tuning on Qwen2-7B-Instruct or Llama-3.1-8B using the datasets blip_laion_cc_sbu_558k, LLaVA-NeXT-Data, and ShareGPTVideo. Feel free to adapt it to other models.

cd open_omni
bash scripts/lvlm_pretrain.sh
bash scripts/lvlm_finetune.sh
bash scripts/lvlm_dpo.sh

Alternatively, you can use an off-the-shelf speech model like Llama-3.1-8B-Omni and enhance it with visual understanding by running

bash scripts/finetune_visual.sh

2. Audio/Speech instruction tuning

Similarly, you can use an LLM or off-the-shelf vis 8000 ual model like LongVA-7B to enhace it with auditory understanding by running

bash scripts/finetune_auditory.sh

To assist those with limited computational resources, we also provide an off-the-shelf checkpoint. Check it out at Model

We can combine step 1 and step 2 to perform visual-audio instruction tuning simultaneously

To enhance the model's visual-audio understanding capabilities, we offer a script to fine-tune it using the a synthetic Dataset dataset, which convert the queries of llava-next to speech by CosyVoice dataset. This aims to improve visual-audio alignment performance. (This process takes ~140 hours on 4 A800 GPU)

NOTE: We find that this process is more prone to collapse than audio instruction tuning alone, so we provide a model just for further study.

bash scripts/finetune_visionaudio.sh

For those with limited computational resources, we also provide a ready-to-use checkpoint (17500 step). You can access it here Model

Try the visual-audio base model through python -m local_demo.baseline_audio_cli --video_path local_demo/assets/water.mp4 --question_audio "local_demo/wav/water.mp4.wav"

3. Speech generator tuning

For speech generation, we adopt the tuning strategy from LLaMA-Omni, utilizing the connectionist temporal classification (CTC) loss to align the hidden states of the LLM with discrete speech units extracted by the HuBERT and K-means models.

bash scritps/finetune_speechgen.sh

As a result, you can flexibly enhance your model to progressively approach an omni model by following one of the previous post-training steps.

Usage

CLI Inference

We demonstrate a usage example for our OpenOmni-7B-Qwen2-Omni model, which is fine-tuned from LongVA using VoiceAssistant (100K).

import os
import json
from PIL import Image
import numpy as np
import torchaudio
import torch
from decord import VideoReader, cpu
import whisper
import soundfile as sf
# fix seed
torch.manual_seed(0)

from fairseq import utils as fairseq_utils
from fairseq.models.text_to_speech.vocoder import CodeHiFiGANVocoder

from open_omni.model.builder import load_pretrained_model
from open_omni.mm_utils import tokenizer_image_speech_tokens, process_images, ctc_postprocess
from open_omni.constants import IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX

import warnings
warnings.filterwarnings("ignore")

# config OpenOmni
model_path = "checkpoints/OpenOmni-7B-Qwen2-Omni"
video_path = "local_demo/assets/water.mp4"
audio_path = "local_demo/wav/water.mp4.wav"
max_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)
gen_kwargs = {"do_sample": True, "temperature": 0.5, "top_p": None, "num_beams": 1, "use_cache": True, "max_new_tokens": 1024}
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "llava_s2s_qwen", device_map="cuda:0") # for llama -> llava_s2s_llama

# config vocoder
with open("checkpoints/vocoder/config.json") as f:
    vocoder_cfg = json.load(f)
vocoder = CodeHiFiGANVocoder("checkpoints/vocoder/g_00500000", vocoder_cfg).cuda()

# query input
query = "Give a detailed caption of the video as if I am blind."
query = None # comment this to use ChatTTS to convert the query to audio

#video input
prompt = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image><|im_end|>\n<|im_start|>user\n<speech>\n<|im_end|>\n<|im_start|>assistant\n"
input_ids = tokenizer_image_speech_tokens(prompt, tokenizer, IMAGE_TOKEN_INDEX, SPEECH_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(model.device)
vr = VideoReader(video_path, ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frames = vr.get_batch(frame_idx).asnumpy()
video_tensor = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].to(model.device, dtype=torch.float16)

#audio input
# process speech for input question
if query is not None:
    import ChatTTS
    chat = ChatTTS.Chat()
    chat.load(source='local', compile=True)
    audio_path = "./local_demo/wav/" + "infer.wav"
    if os.path.exists(audio_path): os.remove(audio_path) # refresh
    if not os.path.exists(audio_path):
        wav = chat.infer(query)
        try:
            torchaudio.save(audio_path, torch.from_numpy(wav).unsqueeze(0), 24000)
        except:
            torchaudio.save(audio_path, torch.from_numpy(wav), 24000)
    print(f"Human: {query}")
  
else:
    print("Human: <audio>")
  
speech = whisper.load_audio(audio_path)
speech = whisper.pad_or_trim(speech)
speech = whisper.log_mel_spectrogram(speech, n_mels=128).permute(1, 0).to(device=model.device, dtype=torch.float16)
speech_length = torch.LongTensor([speech.shape[0]]).to(model.device)

with torch.inference_mode():
    output_ids, output_units = model.generate(input_ids, images=[video_tensor],  modalities=["video"], speeches=speech.unsqueeze(0), speech_lengths=speech_length, **gen_kwargs)
outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(f"Agent: {outputs}")

output_units = ctc_postprocess(output_units, blank=model.config.unit_vocab_size)
output_units = [(list(map(int, output_units.strip().split())))]
print(f"Units: {output_units}")
x = {"code": torch.LongTensor(output_units[0]).view(1,-1)}
x = fairseq_utils.move_to_cuda(x)
wav = vocoder(x, True)
output_file_path = "local_demo/wav/output.wav"
sf.write(
    output_file_path,
    wav.detach().cpu().numpy(),
    16000
)
print(f"The generated wav saved to {output_file_path}")

Gradio Demo

  1. Launch a controller.
python -m local_demo.controller --host 0.0.0.0 --port 10000
  1. Launch a gradio web server.
python -m local_demo.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder checkpoints/vocoder/g_00500000 --vocoder-cfg checkpoints/vocoder/config.json

NOTE: for llama models change template_name from qwen_1_5 to llava_llama_3 in Line 115 in local_demo/gradio_web_server.py

  1. Launch a model worker.
python -m local_demo.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path checkpoints/OpenOmni-7B-Qwen2-Omni --model-name llava_s2s_qwen
  1. Visit http://localhost:8000/
gradio_demo.mp4

Roadmap

  • To collect high quanlity visual-audio-speech data
  • To support streaming
  • To support multilingual and more voice control
  • To support more LLM/Vision Encoder/Speech Encoder/Speech Coder

Acknowledgement

Citation

If you find our repository helpful, please consider citing our new work.

@article{omnimmi,
    title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},
    author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},
    journal={CVPR},
    year={2025}
}
0