The authors replaced a hand-written matmul-add pair with nn.Linear(bias=True) and stacked three layers with activations to form an MLP block. They used an NVIDIA A100-SXM4-80GB GPU to run the scripts. The post builds on Part 1's profiler trace analysis, covering CPU dispatch and torch.compile internals.
Tap to vote and see what everyone thinks.