
DFlash drafts entire blocks of tokens in parallel rather than one at a time, achieving up to 15x higher throughput on NVIDIA Blackwell GPUs. The method keeps output lossless by having a small draft model propose future tokens that a large target model verifies simultaneously. This addresses the serial bottleneck that leaves modern GPUs underused during autoregressive generation.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Open-Source AI Models Are Eating The Frontier: Where Value Goes