[go: up one dir, main page]

Skip to content

A proof of concept for what should be SOTA video embeddings

Notifications You must be signed in to change notification settings

jagilley/video-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

video-embeddings

In principle, you should be able to combine an Automatic Speech Recognition model like Whisper, an image captioning model like Mini-GPT4, and just about any kind of gpt-3.5-quality LLM to get near-human-level understanding of videos. The formula would be something like:

  1. Download video
  2. Run Whisper to transcribe what's being said in the video
  3. Run Mini-GPT4 image captioning on frames extracted from the video every 15 seconds or so
  4. Interleave the audio transcriptions and image captions in text
  5. Use an LLM to summarize and understand what's going on in the video
  6. (optional) encode the LLM's summary with any sort of text embeddings to do things like classification of the videos

I see no reason why the resulting encodings would not be state of the art, specifically for the purposes of understanding videos.

About

A proof of concept for what should be SOTA video embeddings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published