OpenAI released a 750-task benchmark to evaluate AI performance in real life science research. Its top model GPT-Rosalind passed only 36.1% of tasks, failing 63.9%. Results show AI struggles with complex experimental design when inputs are in natural language, not LaTeX.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
New benchmark tests AI models against Russian propaganda