The distillation double bind: Distilling misaligned models either transfers misalignment or it doesn't

·LessWrong··

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen:Misalignment doesn’t transfer to the student. If so, we get a fairly capable benign model, which we can use to perform tasks that we wouldn’t want a misaligned AI to perform.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., because it is less capable). If so, auditing the distilled model might ...

Read full article →

Related Articles

Swiss parliament lifts ban on new nuclear power plants
leonidasrup · Hacker News · 9h ago
Lore – Open source version control system designed for scalability
regnerba · Hacker News · 1d ago
AMD silently removes memory encryption from consumer Ryzen CPUs
lompad · Hacker News · 15h ago
Volkswagen started blocking GrapheneOS users
microtonal · Hacker News · 1d ago
DeepSeek Introduces Vision
RIshabh235 · Hacker News · 17h ago