Alignment pretraining could backfire

·LessWrong··

Epistemic status: speculative, but I think the mechanism is plausible.There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why."I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lea...

Read full article →

Related Articles

Lore – Open source version control system designed for scalability
regnerba · Hacker News · 7h ago
Volkswagen started blocking GrapheneOS users
microtonal · Hacker News · 7h ago
US holds off blacklisting DeepSeek, more than 100 firms deemed security risks
giuliomagnifico · Hacker News · 18h ago
RFC 10008: The new HTTP Query Method
schappim · Hacker News · 11h ago
Apple is about to make Hide My Email useless
SXX · Hacker News · 1d ago