Backdoors as an analogy for deceptive alignment

Jacob Hilton·ARC·AI Safety·September 6, 2024

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in order to give itself the opportunity to behave uncooperatively later. In our paper, we prove several theoretical results that shed some light on possible mitigations for deceptive alignment, alb...

Read full article →

Backdoors as an analogy for deceptive alignment

Related Articles