1 story in the last 7 days
The latest ai-benchmarks news, distilled by AI into sharp ~100-word summaries. ByteBrief tracks ai-benchmarks across dozens of tech sources and brings you only what matters, updated hourly. Tap any story for the full brief, or open the original source.

OpenAI's GPT-5.5 from April, using the Codex harness, scored 24.0% on the new Agents' Last Exam benchmark, beating Anthropic's Mythos-class Claude Fable 5. The benchmark, created by UC Berkeley's RDI and over 300 experts, measures AI performance on long-horizon professional workflows.
Summaries by ByteBrief