A taxonomy of barriers to trading with early misaligned AIs

·Redwood Research··

We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For example, we could ask a schemer who has been undeployed to review its past actions and point out when its instances had secretly colluded to sabotage safety research in the past: we’ll gain legible evidence for scheming risk and data to iterate against, and in exchange promise the schemer, who now has no good option for furthering its values, some ...

Read full article →

Related Articles

SFT Drives Gemini’s Safety Properties
Josh Engels · Alignment Forum · 22d ago
Sequent: scale and automation for higher confidence in alignment
Geoffrey Irving · Alignment Forum · 25d ago
A Mike's-Eye View of ARC's Research
Michael Winer · ARC · 25d ago
My research: a computational cognitive neuroscience perspective on alignment
Seth Herd · Alignment Forum · 1mo ago
A Pipeline for Generating Synthetic Sabotage Trajectories to Red-Team Monitors
Myles H · LessWrong · 1mo ago