In a landmark achievement for AI reasoning capabilities, Claude 5 has achieved an unprecedented 87.3% accuracy on the GPQA Diamond benchmark—a test designed to evaluate genuine scientific reasoning across complex domains including biology, chemistry, and physics.
This score exceeds the previous record of 79.2% by a substantial 8.1 percentage points, breaking through what researchers had previously estimated would be a 2-3 year milestone. The improvement represents roughly four years of prior progress condensed into a single model update.
What is GPQA Diamond?
The GPQA (Graduate-Level Google-Proof Q&A) Diamond benchmark consists of questions that typically require 2-3 hours for a PhD-level expert to answer correctly. These questions are designed to be resistant to simple information retrieval, requiring genuine scientific reasoning and domain expertise.
Extended Thinking Mode: The Key to Breakthrough
The breakthrough came primarily through inference-time optimization rather than simply scaling up model size. Claude 5's Extended Thinking mode achieved 87.3% compared to 72.1% in standard mode—a remarkable 15-point improvement from the same underlying model.
This suggests that giving AI systems more time to "think through" complex problems, rather than generating immediate responses, can dramatically improve reasoning capabilities on challenging tasks.
"This breakthrough confirms our thesis: reasoning is learnable, and scale alone was never the path forward."
— Anthropic's Chief Scientist
Implications for AI Development
The results have significant implications for the field of AI development:
- Compute allocation matters: How computational resources are used during inference may be as important as training scale
- Reasoning as a skill: Advanced reasoning appears to be something that can be explicitly developed and improved
- Practical applications: Scientific research, complex analysis, and technical problem-solving could see substantial improvements
What This Means for Users
For Claude users, this advancement translates to significantly more reliable answers on complex technical questions, better performance on multi-step reasoning tasks, and improved accuracy in specialized domains like science, mathematics, and engineering.
The Extended Thinking mode is now available to Claude Pro subscribers and API users, with Anthropic noting that users can expect the most dramatic improvements on questions that require careful analysis rather than quick factual recall.
Looking Forward
Anthropic has indicated that this represents just the beginning of their exploration into inference-time reasoning optimization. Future updates may include user-configurable thinking time, specialized reasoning modes for different domains, and further improvements to the Extended Thinking architecture.
Source: Claude5.ai →