Why Do Naive SFT Filters For Safety Properties Fail?

·LessWrong··

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hy...

Read full article →

Related Articles

A 'cold blob' in the Atlantic could be a sign of AMOC shutdown
tambourine_man · Hacker News · 8h ago
Rio de Janeiro's "homegrown" LLM appears to be a merge of an existing model
unrvl22 · Hacker News · 7h ago
Noise infusion banned from statistical products published by Census Bureau
nl · Hacker News · 1d ago
Yserver: A modern X11 server written in Rust
Venn1 · Hacker News · 3h ago
The redistribution of housing wealth caused by rent control [pdf]
luu · Hacker News · 19h ago