The Pragmatic Interpretability Trap

·LessWrong··

TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you ...

Read full article →

Related Articles

An OpenAI model has disproved a central conjecture in discrete geometry
tedsanders · Hacker News · 20h ago
GitHub confirms breach of 3,800 repos via malicious VSCode extension
Timofeibu · Hacker News · 1d ago
Show HN: Rmux – A programmable terminal multiplexer with a Playwright-style SDK
shideneyu · Hacker News · 6h ago
Incident Report: May 19, 2026 – GCP Account Suspension
0xedb · Hacker News · 1d ago
Not alive, but not dead: disembodied human brains used for drug testing
Timofeibu · Hacker News · 19h ago