Benchmark
EvaluationA standardized test used to measure and compare AI model capabilities across specific tasks.
Full Explanation
Common benchmarks include: MMLU (massive multitask language understanding — college-level knowledge), HumanEval (coding ability), MATH-500 (competition math), MT-Bench (multi-turn conversation), HellaSwag (common sense reasoning). Benchmarks enable objective comparisons but can be 'gamed' by training on similar data. Real-world performance often differs from benchmark scores.
GPT-4o scores 86% on MMLU. Claude Opus 4 scores 88%. Gemini 2.5 Pro scores 90%.
Related Terms
A type of AI model trained on vast amounts of text data that can generate, summarize, translate, and reason about language.
AI models specifically optimized for multi-step logical reasoning, math, and complex problem-solving — typically by using chain-of-thought at inference time.