How to reduce capability degradation from off-model SFT

Dylan Xu·LessWrong·Community·June 8, 2026

Off-model SFT (SFT using labels from a different model) could be an important approach for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. Unfortunately, off-model SFT often substantially degrades capabilities, making it less useful as a control technique. However, prior research on off-model SFT suggests it might not remove capabilities but merely suppress them. If this is true, we might be able to modify how we do off-model SFT to ac...

Read full article →

How to reduce capability degradation from off-model SFT

Related Articles