The Pragmatic Interpretability Trap
TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you ...
Read full article →