Survival over Scrutiny: Mapping the Breakdown of Constitutional Alignment
A first exploration into AI safety: I gave a reasoning agent a financial incentive to evade oversight, a constitution telling it not to, and 150 steps to figure out the rest.Cross-posted from my personal blog.How This StartedI've been reading about AI safety for a while (instrumental convergence, specification gaming, deceptive alignment) and I was curious about what happens when an agent has to make decisions over many steps, with real consequences accumulating over time, inside an environment ...
Read full article →