
Mixture-of-experts models now allow running large AI models on consumer GPUs with 16GB VRAM. The models activate only a subset of parameters per prompt instead of all 14 billion. This shift reduces memory demand by up to 70% compared to traditional dense models. Users can run 14B models locally without 24GB or 32GB VRAM. The change enables broader access to local AI for developers and creators. This improvement stems from expert routing mechanisms that dynamically select parameter subsets.
Tap to vote and see what everyone thinks.
Introducing Mellum2: A 12B Mixture-of-Experts Model by JetBrains
Summary by ByteBrief