Claude is Now Alignment-Pretrained

RogerDearnaley·LessWrong·Community·May 13, 2026

Anthropic are now actively using the approach to alignment often called “Alignment Pretraining” or “Safety Pretraining” — using Stochastic Gradient Descent on a large body of natural or synthetic documents showing the AI assistant doing the right thing in morally challenging situations. They tried this out, found it works well and generalizes well, and they’re now using it.I’m absolutely delighted. I’ve been repeatedly advocating this approach on LessWrong and the Alignment Forum for a couple of...

Read full article →

Claude is Now Alignment-Pretrained

Related Articles