The Pragmatic Interpretability Trap

Yogesh Prabhu·LessWrong·Community·May 11, 2026

TLDR: Pragmatic interp sounds great in the sense that you get to keep interp tools while actually moving safety metrics, but looking a bit closer it's kinda a trap. You pay interp's overhead but get judged against black-box baselines that don't, so the work that survives is whatever cleared that bar, not whatever produced understanding. The two scoreboards (understanding vs intervention) don't actually run side by side; the metric one eats the other, and stuff like NLAs end up failing both, you ...

Read full article →

The Pragmatic Interpretability Trap

Related Articles