The Pragmatic Interpretability Trap

·LessWrong··

TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you ...

Read full article →

Related Articles

“Beyond the limit”: Satellites and mirrors in space pose threat to the night sky
Breadmaker · Hacker News · 1d ago
GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance
maille · Hacker News · 19h ago
Potential session/cache leakage between workspace instances or consumer accounts
chatmasta · Hacker News · 1d ago
EV Batteries Are Defying Expectations After Miles
apparent · Hacker News · 10h ago
Show HN: KiCad in the Browser
ViktorEE · Hacker News · 5h ago