We Compare AI

RLHF

Training
Simple Definition

Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.

Full Explanation

RLHF involves: 1) Fine-tuning the base model on demonstration data, 2) Training a reward model on human preference comparisons (which output is better?), 3) Using reinforcement learning to optimize the LLM to produce outputs the reward model scores highly. This is how ChatGPT, Claude, and Gemini are aligned to be helpful and safe, rather than just predicting text statistically.

Last verified: 2026-03-30← Back to Glossary