How useful is cross-domain generalization for training LLM monitors?

·LessWrong··

We study how well single-token classification training generalizes and find that:Control monitor performance can be improved by training on adjacent classification tasks (which is useful if we don't have high-quality in-domain data).We do not find strong benefits from classification-specialized models: training on instruction following after classification training not only keeps the uplift from classification training, but it also mitigates some generalization failures that arise from training ...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
EV Batteries Are Defying Expectations After Miles
apparent · Hacker News · 10h ago
Astrophysicists Puzzle over Webb’s New Universe
jnord · Hacker News · 1d ago