
New AI benchmarks now prioritize consistency over memorization. The tests evaluate how well models maintain logical coherence across long sequences. Results show models perform poorly when asked to follow multi-step instructions. The benchmarks include 100 task chains with 500+ steps. This shift helps developers identify models that avoid hallucination. The evaluation framework is used by OpenAI and Anthropic.
Tap to vote and see what everyone thinks.
[AINews] Microsoft Build: MAI-Thinking-1 and MAI Family models
Summary by ByteBrief