
The AA-Briefcase benchmark tests AI on multi-week projects with thousands of fragmented files. Top model Claude Fable 5 fully solves just 3 percent of 91 tasks. On 31 tasks, no model exceeds 50 percent. Per-task costs range from $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Stop overloading your skills