In principle, you should be able to combine an Automatic Speech Recognition model like Whisper, an image captioning model like Mini-GPT4, and just about any kind of gpt-3.5-quality LLM to get near-human-level understanding of videos. The formula would be something like:
- Download video
- Run Whisper to transcribe what's being said in the video
- Run Mini-GPT4 image captioning on frames extracted from the video every 15 seconds or so
- Interleave the audio transcriptions and image captions in text
- Use an LLM to summarize and understand what's going on in the video
- (optional) encode the LLM's summary with any sort of text embeddings to do things like classification of the videos
I see no reason why the resulting encodings would not be state of the art, specifically for the purposes of understanding videos.