video-embeddings

In principle, you should be able to combine an Automatic Speech Recognition model like Whisper, an image captioning model like Mini-GPT4, and just about any kind of gpt-3.5-quality LLM to get near-human-level understanding of videos. The formula would be something like:

Download video
Run Whisper to transcribe what's being said in the video
Run Mini-GPT4 image captioning on frames extracted from the video every 15 seconds or so
Interleave the audio transcriptions and image captions in text
Use an LLM to summarize and understand what's going on in the video
(optional) encode the LLM's summary with any sort of text embeddings to do things like classification of the videos

I see no reason why the resulting encodings would not be state of the art, specifically for the purposes of understanding videos.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

video-embeddings

About

Releases

Packages

jagilley/video-embeddings

Folders and files

Latest commit

History

Repository files navigation

video-embeddings

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages