Maybe we should pretrain on synthetic data about good-but-reward-hacking AIs
TLDR: The idea is basically inoculation prompting crossed with alignment pretraining. Call it ‘inoculation pretraining.’ It’s a type of spillway design.----------------------------------------------------------------------------------------------------Reward hacking can cause emergent misalignment: you train the AI to cheat on its tasks and it turns broadly evil. Why does this happen?The persona selection model (PSM) and its forebears suggest one explanation. The AI has some prior over personas,...
Read full article →