Eval-Awareness Steering detects the Test, Not the Sabotage

sahilraut·LessWrong·Community·June 25, 2026

Produced as part of independent researchHuge thanks to Apollo Research (org) for open-sourcing the deception-detection harness which proved to be foundational in this work. Prior work by Devbunova (2026), the Apollo/Goldowsky-Dill probing line, and Tice et al. on noise injection shaped the design throughout.SummaryI test whether the internal "I'm being evaluated" direction in an open-weight model causally drives sandbagging (deliberate underperformance) or merely correlates with it. I reuse Apol...

Read full article →

Eval-Awareness Steering detects the Test, Not the Sabotage

Related Articles