How to reduce capability degradation from off-model SFT
Off-model SFT (SFT using labels from a different model) could be an important approach for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. Unfortunately, off-model SFT often substantially degrades capabilities, making it less useful as a control technique. However, prior research on off-model SFT suggests it might not remove capabilities but merely suppress them. If this is true, we might be able to modify how we do off-model SFT to ac...
Read full article →