Predicting Rare LLM Failures with 30× Fewer Rollouts

·LessWrong··

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space.Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire)IntroductionA harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
EV Batteries Are Defying Expectations After Miles
apparent · Hacker News · 10h ago
Astrophysicists Puzzle over Webb’s New Universe
jnord · Hacker News · 1d ago