Constitutional AI
SafetyAnthropic's technique for training Claude to be helpful, harmless, and honest by having AI models critique and revise their own outputs based on a set of principles.
Full Explanation
Constitutional AI (CAI) involves two phases: 1) Supervised learning phase where the model revises its responses to comply with a 'constitution' of principles, 2) RL phase where a preference model trained on AI feedback (not human feedback) guides further training. This allows scaling alignment supervision with less human labeling. It's how Claude was made safer than models trained with pure RLHF.
Related Terms
Reinforcement Learning from Human Feedback — a training technique used to align AI models with human preferences and values.
The challenge of ensuring AI systems behave in accordance with human values, intentions, and societal well-being.
Further training a pre-trained AI model on a smaller, task-specific dataset to specialize its behavior.