
NVIDIA released Cosmos 3, a multimodal model unifying language, image, video, audio and action using a Mixture-of-Transformers architecture. It features base Nano with 8B reasoner and 8B generator towers and Super with 32B reasoner and 32B generator towers. Super models include finetunes for Text2Image and Image2Video, now the new state-of-the-art open weights for image and video generation. These models are available just below Nano Banana 2 at Computex in Taiwan. The architecture pairs an autoregressive reasoner with a diffusion generator to enable multimodal reasoning and generation. This advancement enables developers to build applications with unified vision and language understanding.
Tap to vote and see what everyone thinks.
Why Video Agent models are next, Ethan He, xAI Grok Imagine Lead
Summary by ByteBrief