Eval-Awareness Steering detects the Test, Not the Sabotage
Produced as part of independent researchHuge thanks to Apollo Research (org) for open-sourcing the deception-detection harness which proved to be foundational in this work. Prior work by Devbunova (2026), the Apollo/Goldowsky-Dill probing line, and Tice et al. on noise injection shaped the design throughout.SummaryI test whether the internal "I'm being evaluated" direction in an open-weight model causally drives sandbagging (deliberate underperformance) or merely correlates with it. I reuse Apol...
Read full article →