Incriminating misaligned AI models via distillation
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen:Misalignment fails to transfer to the student. If so, we get a fairly capable benign model.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher’s misalignment by auditing the distilled model.In this post, we will d...
Read full article →