Reasoning Stress Tests
Standardised benchmarks for multi-step reasoning, code correctness, long-context retention, and tool-use accuracy across frontier models.
Multi-Step Reasoning
MATH + MMLU-Pro combined score (%)
Code Correctness
HumanEval pass@1 (%)
Long-Context Retention
RULER 128K score (%) — models with 128K+ context only
Tool-Use Accuracy
Berkeley Function-Calling Leaderboard (%)
Benchmark Sources
MATH + MMLU-Pro (Combined)
A composite score averaging performance on the MATH dataset (competition-level mathematics problems) and MMLU-Pro (Massive Multitask Language Understanding — Professional, 12K harder variants). Tests multi-step logical deduction, algebraic reasoning, and expert-domain knowledge across 57 subject areas.
HumanEval pass@1
OpenAI's benchmark of 164 handwritten Python programming problems. Models are asked to complete a function from its docstring. pass@1 measures the fraction of problems solved correctly on the first attempt, without sampling multiple completions. A strict, real-world measure of code generation quality.
RULER 128K
Realistic Universal Long-context Evaluation benchmark at the 128K token level. Tests long-range information retrieval, multi-hop reasoning over long documents, and avoidance of distraction injection. Only models with a native context window of 128K+ are included.
Berkeley Function-Calling Leaderboard
Evaluates a model's ability to correctly invoke external APIs and tools from natural-language instructions. Covers simple, nested, parallel, and irrelevant tool-call scenarios. Tests both the accuracy of tool selection and correctness of argument extraction — critical for agentic workloads.
Scores from published papers and public leaderboards. Last updated Q1 2026. Some results may vary across test conditions, prompt formats, and API versions. Models are ranked within each benchmark independently.