Constitutional AI Alignment
Epistemological status: existential AI safety suggestions as a set of bullet pointsTL;DR We need to be clear about what behavior we’re trying to train when we align AIs — aligned behavior is not the default for an LLM whose behavior is distilled from ours, nor for an RL-trained maximizer. Anthropic published an early constitution, and recently a large update in Claude’s Constitution (a.k.a. “Soul Doc”). That includes not just what the AI should do, but explanations of how and why many decisions ...
Read full article →