Alignment

Safety

Simple Definition

The challenge of ensuring AI systems behave in accordance with human values, intentions, and societal well-being.

Full Explanation

AI alignment is one of the central problems in AI safety. A misaligned AI pursues goals that differ from its creators' intentions — either through specification errors (we gave it the wrong goal), capability overhang (it's more capable than we realized), or mesa-optimization (it develops its own sub-goals during training). RLHF, Constitutional AI, and scalable oversight are current alignment techniques.

Related Terms

RLHF

Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.

Constitutional AI

Anthropic's technique for training Claude to be helpful, harmless, and honest by having AI models critique and revise their own outputs based on a set of principles.

AI Safety

The field of research and practice focused on building AI systems that are safe, reliable, controllable, and aligned with human values.

Last verified: 2026-03-30← Back to Glossary