Why Do Naive SFT Filters For Safety Properties Fail?

Josh Engels·LessWrong·Community·June 14, 2026

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hy...

Read full article →

Why Do Naive SFT Filters For Safety Properties Fail?

Related Articles