
MiniMax released MiniMax Sparse Attention, a two-branch sparse attention method trained on a 109B-parameter Mixture-of-Experts model using 3T tokens. It splits attention into Index and Main Branches to reduce quadratic softmax cost. The method powers MiniMax-M3, a production model with open-sourced inference kernel.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Google unveils DiffusionGemma, a 26B open model