AIThe Decoderabout 2 hours ago

AI models fail 97% of real knowledge work tasks

1 min read

The AA-Briefcase benchmark tests AI on multi-week projects with thousands of fragmented files. Top model Claude Fable 5 fully solves just 3 percent of 91 tasks. On 31 tasks, no model exceeds 50 percent. Per-task costs range from $0.04 for DeepSeek V4 Flash to over $31 for Claude Fable 5.

Level

Hype check

Tap to vote and see what everyone thinks.

#ai #benchmark #artificial analysis

AI models fail 97% of real knowledge work tasks

More to chew on!

More to chew on!