GateGPT achieves 56,000 tokens per second for a Transformer with KV cache on an FPGA running at 80 MHz. The design demonstrates high-throughput inference using a custom hardware accelerator. This implementation targets efficient deployment of large language models on reconfigurable logic.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Google's DiffusionGemma generates 256 tokens in parallel