Claude is Now Alignment-Pretrained
Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing in morally challenging situations. They tried this out, found it works well and generalizes well, and they’re now using it.I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of...
Read full article →