Benchmark

Evaluation

Simple Definition

A standardized test used to measure and compare AI model capabilities across specific tasks.

Full Explanation

Common benchmarks include: MMLU (massive multitask language understanding — college-level knowledge), HumanEval (coding ability), MATH-500 (competition math), MT-Bench (multi-turn conversation), HellaSwag (common sense reasoning). Benchmarks enable objective comparisons but can be 'gamed' by training on similar data. Real-world performance often differs from benchmark scores.

Example

GPT-4o scores 86% on MMLU. Claude Opus 4 scores 88%. Gemini 2.5 Pro scores 90%.

Related Terms

Large Language Model (LLM)

A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.

Reasoning Model

AI models specifically optimized for multi-step logical reasoning, math, and complex problem-solving — typically by using chain-of-thought at inference time.

Last verified: 2026-03-30← Back to Glossary