Why does off-model SFT degrade capabilities?

SebastianP·LessWrong·Community·May 21, 2026

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it’s because off-model SFT forces the model into an unfamiliar reasoning style that it’s bad at usi...

Read full article →

Why does off-model SFT degrade capabilities?

Related Articles