
A three-billion-parameter model from China, Hong Kong, and Singapore listens continuously to audio streams and decides every 0.4 seconds whether to output <silent> or <response>. It processes translation, transcription, and real-time reactions in one system using 0.4-second audio chunks. The model scores 58.15 on MMAU, outperforming Qwen2.5-Omni-3B and matching smaller 7B models in English-Chinese translation. Training data was built with scene-based events and generated audio clips from tools like AudioX and ElevenLabs.
Tap to vote and see what everyone thinks.
Gemma 4 12B now available on laptop via Google AI Edge
Summary by ByteBrief