Constitutional AI

Safety

Simple Definition

Anthropic's technique for training Claude to be helpful, harmless, and honest by having AI models critique and revise their own outputs based on a set of principles.

Full Explanation

Constitutional AI (CAI) involves two phases: 1) Supervised learning phase where the model revises its responses to comply with a 'constitution' of principles, 2) RL phase where a preference model trained on AI feedback (not human feedback) guides further training. This allows scaling alignment supervision with less human labeling. It's how Claude was made safer than models trained with pure RLHF.

Related Terms

RLHF

Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.

Alignment

The challenge of ensuring AI systems behave in accordance with human values, intentions, and societal well-being.

Fine-tuning

Further training a pre-trained AI model on a smaller, task-specific dataset to specialize its behavior.

Last verified: 2026-03-30← Back to Glossary