RLHF

Training

Simple Definition

Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.

Full Explanation

RLHF involves: 1) Fine-tuning the base model on demonstration data, 2) Training a reward model on human preference comparisons (which output is better?), 3) Using reinforcement learning to optimize the LLM to produce outputs the reward model scores highly. This is how ChatGPT, Claude, and Gemini are aligned to be helpful and safe, rather than just predicting text statistically.

Related Terms

Fine-tuning

Further training a pre-trained AI model on a smaller, task-specific dataset to specialize its behavior.

Alignment

The challenge of ensuring AI systems behave in accordance with human values, intentions, and societal well-being.

Constitutional AI

Anthropic's technique for training Claude to be helpful, harmless, and honest by having AI models critique and revise their own outputs based on a set of principles.

Last verified: 2026-03-30← Back to Glossary