Ethan He asserts that Video Agent models derive their intelligence primarily from large language models, not from training on video data. He argues that the next frontier for interactive, real-time world models lies in advancing LLMs, possibly through Interaction Models. The next Sora-like model will not be a better video model but an improved LLM. He made this claim during a Latent Space session while leading xAI's Grok Imagine development. This perspective shifts focus from video data training to language-based reasoning for video agents.
Tap to vote and see what everyone thinks.
Why Are Large Language Models So Terrible at Video Games?
Summary by ByteBrief