AIMarkTechPostabout 7 hours ago

MiniMax releases MSA sparse attention for 109B MoE model

8 min read

MiniMax released MiniMax Sparse Attention, a two-branch sparse attention method trained on a 109B-parameter Mixture-of-Experts model using 3T tokens. It splits attention into Index and Main Branches to reduce quadratic softmax cost. The method powers MiniMax-M3, a production model with open-sourced inference kernel.

Level

Hype check

Tap to vote and see what everyone thinks.

#msa #minimax #m3

Read full story

MiniMax releases MSA sparse attention for 109B MoE model

More to chew on!

More to chew on!