Reasoning Stress Tests

Standardised benchmarks for multi-step reasoning, code correctness, long-context retention, and tool-use accuracy across frontier models.

Multi-Step Reasoning

MATH + MMLU-Pro combined score (%)

TopGPT-o1

94.2%

Claude 3.7 Sonnet

88.6%

Gemini 2.0 Flash

85.1%

GPT-4o

83.4%

Gemini 1.5 Pro

81.7%

Llama 3.3 70B

78.9%

Mistral Large

76.2%

Claude 3.5 Haiku

74.3%

Code Correctness

HumanEval pass@1 (%)

TopGPT-o1

94.4%

GPT-4o

90.2%

Claude 3.7 Sonnet

88.7%

Gemini 2.0 Flash

83.5%

Llama 3.3 70B

79.1%

Mistral Large

72.8%

Long-Context Retention

RULER 128K score (%) — models with 128K+ context only

TopGemini 1.5 Pro

94.1%

Claude 3.7 Sonnet

91.2%

Gemini 2.0 Flash

88.3%

Claude 3.5 Haiku

86.7%

GPT-4o

85.4%

Tool-Use Accuracy

Berkeley Function-Calling Leaderboard (%)

TopClaude 3.7 Sonnet

90.1%

GPT-4o

88.5%

GPT-o1

85.3%

Gemini 2.0 Flash

84.2%

Llama 3.3 70B

76.4%

Score key:

≥ 90% — Frontier

80–89% — Strong

70–79% — Moderate

< 70% — Below average

Benchmark Sources

MATH + MMLU-Pro (Combined)

A composite score averaging performance on the MATH dataset (competition-level mathematics problems) and MMLU-Pro (Massive Multitask Language Understanding — Professional, 12K harder variants). Tests multi-step logical deduction, algebraic reasoning, and expert-domain knowledge across 57 subject areas.

HumanEval pass@1

OpenAI's benchmark of 164 handwritten Python programming problems. Models are asked to complete a function from its docstring. pass@1 measures the fraction of problems solved correctly on the first attempt, without sampling multiple completions. A strict, real-world measure of code generation quality.

RULER 128K

Realistic Universal Long-context Evaluation benchmark at the 128K token level. Tests long-range information retrieval, multi-hop reasoning over long documents, and avoidance of distraction injection. Only models with a native context window of 128K+ are included.

Berkeley Function-Calling Leaderboard

Evaluates a model's ability to correctly invoke external APIs and tools from natural-language instructions. Covers simple, nested, parallel, and irrelevant tool-call scenarios. Tests both the accuracy of tool selection and correctness of argument extraction — critical for agentic workloads.

Scores from published papers and public leaderboards. Last updated Q1 2026. Some results may vary across test conditions, prompt formats, and API versions. Models are ranked within each benchmark independently.