OpenAI released LifeSciBench with 750 expert-written tasks across seven biological domains and seven scientific workflows. Each task includes a prompt, artifacts, and a rubric with 25 criteria on average. Tasks require reasoning steps averaging four each and 53% need at least one artifact. Models pass only about one in three tasks. The benchmark includes 1,062 artifacts and 19,020 rubric criteria.
Tap to vote and see what everyone thinks.
Summary by ByteBrief
Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models