Survival over Scrutiny: Mapping the Breakdown of Constitutional Alignment

·LessWrong··

A first exploration into AI safety: I gave a reasoning agent a financial incentive to evade oversight, a constitution telling it not to, and 150 steps to figure out the rest.Cross-posted from my personal blog.How This StartedI've been reading about AI safety for a while (instrumental convergence, specification gaming, deceptive alignment) and I was curious about what happens when an agent has to make decisions over many steps, with real consequences accumulating over time, inside an environment ...

Read full article →

Related Articles

Loss of Oversight: How AI Systems May Become Harder to Audit, Monitor, and Investigate
Jordan Taylor · LessWrong · 37m ago
The Case for Evaluating Model Behaviors
jsteinhardt · Alignment Forum · 20h ago
Mechanistic estimation for expectations of random products
Jacob Hilton · ARC · 5d ago
Multipolar Civilisation Depends on Maintaining an Attacker’s Dilemma
Naci Cankaya · LessWrong · 14d ago
Using Base-LCM to Monitor LLMs
Éloïse Benito-Rodriguez · LessWrong · 14d ago